{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage\n", "\n", "To use `linreg_ally` in a project:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1.0\n" ] } ], "source": [ "import linreg_ally\n", "\n", "print(linreg_ally.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Imports\n", "from vega_datasets import data\n", "from linreg_ally.eda import eda_summary\n", "from linreg_ally.multicollinearity import check_multicollinearity\n", "from linreg_ally.models import run_linear_regression\n", "from linreg_ally.plotting import qq_and_residuals_plot\n", "from sklearn.model_selection import train_test_split " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis (EDA)\n", "\n", "Since we are using the `cars` dataset from the Vega datasets package, it will be helpful to see whether there is a difference among the distributions of the various numerical features when looking at the origin of the car. Such an EDA can easily be achieved using the function `eda_summary` from `linreg_ally.eda`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function usage" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.ConcatChart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Load data\n", "cars = data.cars()\n", "\n", "# Run EDA by subsetting on origin\n", "eda_plot = eda_summary(cars, color='Origin')\n", "\n", "# Show the EDA plot\n", "eda_plot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the EDA plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `eda_summary` function and observing the plot, there is clear evidence that there is a difference among the distributions of the car mileage for cars from the three regions of interest. Here, it is evident that the average gas mileage in miles per gallon (MPG) for US cars is the lowest among the different cars with Japan having the best average gas mileage. Similarly, it can be observed that there might possibly be some association between the horsepower and the displacement due to how similar the distributions look between the two features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Multicollinearity\n", "\n", "Multicollinearity exists when two or more explantory variables in a regression model are correlated. High degree of multicolinearity in a regression model is problematic as it can make coefficient estimates unstable. In the case where there is perfect correlation between two explanatory variables, it can even cause a regression model to fail as it will be impossible to assess how the target variable is affected by a unit change in an explantory variable when holding all other explantory variables constant. \n", "\n", "Multicollinearity can be checked through the Variance Inflator Factors (\"VIF\"). VIF of a given explanatory variable $X_i$ is computed by $\\frac{1}{(1 - R^2_{X_i,\\dots,X_{i-1}})}$ where $R^2_{X_i,\\dots,X_{i-1}}$ is the coefficient of determination from regressing $X_i$ against all other explanatory variables. Typically VIFs between 1 and 5 suggest that there is moderate correlation, but it is not severe enough to warrant any corrective measures. However, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients will be poorly estimated. \n", "\n", "Another way of detecting multicollinearity is to obtain the pairwise correlation between explanatory variables. Correlation close to -1 or 1 suggests severe multicollinearity. \n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "cars = data.cars()\n", "cars = cars.drop(columns=['Name'])\n", "\n", "train_df, test_df = train_test_split(cars, test_size=0.2, random_state=123)\n", "X_train = train_df.iloc[:,1:]\n", "y_train = train_df.iloc[:,0:1]\n", "\n", "X_train_clean = X_train.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `check_multicollinearity()` function provides a simple and convenient way to detect multicollinearity in your dataset. Simply pass the cleaned training dataframe into the function and it will return the VIF for every numeric explanatory variable. You can also set the optional argument `threshold` such that the function returns only explanatory variables with a VIF beyond the set threshold. \n", "\n", "In the example below with the `cars` dataset, we are hoping to fit a linear regression to understand what factors of a car can affect its fuel efficiency. To verify whether multicollinearity exists in our explanatory variables ('Cylinders', 'Displacement', 'Horsepower', 'Weight_in_lbs', 'Acceleration'), we pass our training dataset into `check_multicollinearity()`. The output dataframe suggests our explanatory variables are highly correlated with each other, in particular `Weights_in_lbs` with a VIF of 126.9. In this case, it means differences in `Weight_in_lbs` between car makes can be well explained by other variables with high VIFs such as `Cylinders` and `Displacement`. \n", "\n", "It may be beneficial to revisit the research problem in hand and determine whether `Weight_in_lbs` should be included in our model. Domain knowledge will be particularly useful in guiding this decision." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FeaturesVIF
0Cylinders96.549975
1Displacement71.962976
2Horsepower44.626008
3Weight_in_lbs126.971443
4Acceleration26.482060
\n", "
" ], "text/plain": [ " Features VIF\n", "0 Cylinders 96.549975\n", "1 Displacement 71.962976\n", "2 Horsepower 44.626008\n", "3 Weight_in_lbs 126.971443\n", "4 Acceleration 26.482060" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vif_df = check_multicollinearity(X_train_clean, vif_only=True)\n", "vif_df" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FeaturesVIF
0Cylinders96.549975
1Displacement71.962976
3Weight_in_lbs126.971443
\n", "
" ], "text/plain": [ " Features VIF\n", "0 Cylinders 96.549975\n", "1 Displacement 71.962976\n", "3 Weight_in_lbs 126.971443" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vif_df_w_threshold = check_multicollinearity(X_train_clean, threshold=50, vif_only=True)\n", "vif_df_w_threshold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`check_multicollinearity()` also allow users to obtain pairwise Pearson Correlation for the explanatory variables in the training set. To do so, simply set the argument `vif_only` as `FALSE`. \n", "\n", "Back to our example with the `cars` dataset, the pairwise Pearson Correlations seem to agree with our VIF analysis that the explanatory variables are highly correlated with each other. Considerations should go into whether some of these variables should be dropped to ensure robustness of the regression model." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vif_df, corr_chart = check_multicollinearity(X_train_clean, vif_only=False)\n", "corr_chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running Linear Regression Tutorial\n", "\n", "In this tutorial, you will learn a streamlined way to preprocess data, run linear regression and output with scoring metrics.\n", "\n", "First, ensure you have the `models` package imported." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from linreg_ally.models import run_linear_regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using the `cars` dataset provided by `vega_datasets`. This dataset contains various features related to cars, including both numerical and categorical variables, making it ideal for demonstrating the full capabilities of our linear regression function." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = data.cars()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, the dataset includes data about different car models, featuring attributes such as `Miles_per_Gallon`, `Cylinders`, `Displacement` etc. We will utilize these attributes to build a linear regression model, predicting the target variable `Horsepower`.\n", "\n", "We will first perform some data cleaning by removing columns that contain `NA` values." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df = df[['Horsepower', 'Displacement']].dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the dataset loaded, you're all set to move forward to the next step: using our package's `run_linear_regression` function to prepare the data, fit a model, and evaluate its performance.\n", "\n", "We will specify the `target_column`, `numeric_feats`, `categorical_feats` and `drop_feats`. In this case, `target_column` will be `Horsepower` since we are trying to predict its value. `numeric_feats` will be all the numeric features that we want to scale using scikit-learn's `StandardScaler`. `categorical_feats` will be the categorical features (in this case only `Origin`) that we want to perform one-hot encoding on using scikit-learn's `OneHotEncoder`. `drop_feats` will be the columns that we do not want to include in the analysis, in which in this case will be `Name` since it does not provide any meaningful information to the analysis.\n", "\n", "For the `scoring_metrics`, we will specify `r2` to evaluate the performance of the model on test data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Summary\n", "------------------------\n", "Test r2: 0.846\n" ] } ], "source": [ "df = data.cars()\n", "df = df.dropna()\n", "\n", "# Define parameters for run_linear_regression\n", "target_column = \"Horsepower\"\n", "numeric_feats = [\"Miles_per_Gallon\", \"Cylinders\", \"Displacement\", \"Weight_in_lbs\", \"Acceleration\"] \n", "categorical_feats = [\"Origin\"]\n", "drop_feats = [\"Name\"]\n", "random_state = 123\n", "scoring_metrics = [\"r2\"]\n", "\n", "best_model, X_train, X_test, y_train, y_test, scores = run_linear_regression(\n", " dataframe=df,\n", " target_column=target_column,\n", " numeric_feats=numeric_feats,\n", " categorical_feats=categorical_feats,\n", " drop_feats=drop_feats,\n", " random_state=random_state,\n", " scoring_metrics=scoring_metrics\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`best_model` provides a visual summary of the steps used in the entire linear regression pipeline." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessor',\n",
                            "                 ColumnTransformer(transformers=[('standardscaler',\n",
                            "                                                  StandardScaler(),\n",
                            "                                                  ['Miles_per_Gallon',\n",
                            "                                                   'Cylinders', 'Displacement',\n",
                            "                                                   'Weight_in_lbs',\n",
                            "                                                   'Acceleration']),\n",
                            "                                                 ('onehotencoder',\n",
                            "                                                  OneHotEncoder(), ['Origin']),\n",
                            "                                                 ('drop', 'drop', ['Name'])])),\n",
                            "                ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(transformers=[('standardscaler',\n", " StandardScaler(),\n", " ['Miles_per_Gallon',\n", " 'Cylinders', 'Displacement',\n", " 'Weight_in_lbs',\n", " 'Acceleration']),\n", " ('onehotencoder',\n", " OneHotEncoder(), ['Origin']),\n", " ('drop', 'drop', ['Name'])])),\n", " ('model', LinearRegression())])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "best_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scores give the R² and negative mean squared error scores that we are interested in finding out in order to understand how the model performs on the test data." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'r2': 0.8463952369304465}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, an R² score of 85% indicates that 85% of the variance in the dependent variable can be explained by the independent variables included in the model, showing that the model provides a good fit to the data.\n", "\n", "However, R² alone does not tell the whole story, for example if there might be multicollinearity or other issues. You might also want to consider other metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or visually inspect residual plots to gain a more comprehensive understanding of model performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the end of this tutorial where you have seen how we use the `run_linear_regression` function in our package to preprocess data, run linear regression and output with scoring metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking Normality and Homoscedasticity of Residuals\n", "\n", "A linear regression model assumes that residuals are normally distributed and have constant variance (homoscedasticity). To check whether these assumptions are met, we use the `qq_and_residuals_plot` function. This function generates:\n", "\n", "1. A Quantile-Quantile (Q-Q) plot to assess the normality of residuals.\n", "2. A Residuals vs. Fitted Values plot to check for homoscedasticity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `qq_and_residuals_plot` function takes two parameters: `y_actual` and `y_predicted`. These values were extracted from the linear regression model we previously created." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# y_actual is y_test (true labels)\n", "y_actual = y_test\n", "\n", "# y_predicted is obtained by predicting on X_test\n", "y_predicted = best_model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that `y_actual` and `y_predicted` have been extracted, let's pass these parameters to the `qq_and_residuals_plot` function." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "qq_and_residuals_plot(y_actual, y_predicted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the Q-Q Plot\n", "\n", "If the Q-Q plot shows a significant deviation from the red dashed line (which represents perfect normality), the residuals are not normally distributed. In our plot, a few points deviate from the line at the tails, suggesting potential skewness or outliers. However, since these deviations are minor, we can conclude that the residuals are approximately normal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the Residuals vs. Fitted Values Plot\n", "\n", "For the homoscedasticity assumption to hold, residuals should be randomly scattered around the red dashed line in the Residuals vs. Fitted Values plot. This would indicate that residual variance remains constant across all fitted values (homoscedasticity).\n", "\n", "However, in our case, the residuals cluster at different fitted value ranges, and the spread increases as the fitted values increase, suggesting that the variance is not constant (heteroscedasticity)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implications of Assumption Violations\n", "\n", "If the normality assumption is violated:\n", "Ordinary Least Squares (OLS) regression still produces best linear unbiased estimates (BLUE) as long as independence and homoscedasticity hold. However, hypothesis tests and confidence intervals may be misleading if residuals deviate significantly from normality.\n", "\n", "If the homoscedasticity assumption is violated:\n", "You can still fit a linear regression model, but you should interpret results with caution. The estimated coefficients remain unbiased, but standard errors and p-values become unreliable, affecting statistical inference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "The `qq_and_residuals_plot` function is a valuable tool for assessing the normality and homoscedasticity assumptions in linear regression. If these assumptions are violated, you should consider corrective measures such as:\n", "\n", "- Transforming variables (e.g., logarithmic transformation),\n", "- Using robust standard errors, or\n", "- Exploring alternative models (e.g., weighted least squares, generalized least squares)." ] } ], "metadata": { "kernelspec": { "display_name": "linreg", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }