{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage\n", "\n", "To use `linreg_ally` in a project:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1.0\n" ] } ], "source": [ "import linreg_ally\n", "\n", "print(linreg_ally.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Imports\n", "from vega_datasets import data\n", "from linreg_ally.eda import eda_summary\n", "from linreg_ally.multicollinearity import check_multicollinearity\n", "from linreg_ally.models import run_linear_regression\n", "from linreg_ally.plotting import qq_and_residuals_plot\n", "from sklearn.model_selection import train_test_split " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis (EDA)\n", "\n", "Since we are using the `cars` dataset from the Vega datasets package, it will be helpful to see whether there is a difference among the distributions of the various numerical features when looking at the origin of the car. Such an EDA can easily be achieved using the function `eda_summary` from `linreg_ally.eda`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function usage" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.ConcatChart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Load data\n", "cars = data.cars()\n", "\n", "# Run EDA by subsetting on origin\n", "eda_plot = eda_summary(cars, color='Origin')\n", "\n", "# Show the EDA plot\n", "eda_plot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting the EDA plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the `eda_summary` function and observing the plot, there is clear evidence that there is a difference among the distributions of the car mileage for cars from the three regions of interest. Here, it is evident that the average gas mileage in miles per gallon (MPG) for US cars is the lowest among the different cars with Japan having the best average gas mileage. Similarly, it can be observed that there might possibly be some association between the horsepower and the displacement due to how similar the distributions look between the two features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Multicollinearity\n", "\n", "Multicollinearity exists when two or more explantory variables in a regression model are correlated. High degree of multicolinearity in a regression model is problematic as it can make coefficient estimates unstable. In the case where there is perfect correlation between two explanatory variables, it can even cause a regression model to fail as it will be impossible to assess how the target variable is affected by a unit change in an explantory variable when holding all other explantory variables constant. \n", "\n", "Multicollinearity can be checked through the Variance Inflator Factors (\"VIF\"). VIF of a given explanatory variable $X_i$ is computed by $\\frac{1}{(1 - R^2_{X_i,\\dots,X_{i-1}})}$ where $R^2_{X_i,\\dots,X_{i-1}}$ is the coefficient of determination from regressing $X_i$ against all other explanatory variables. Typically VIFs between 1 and 5 suggest that there is moderate correlation, but it is not severe enough to warrant any corrective measures. However, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients will be poorly estimated. \n", "\n", "Another way of detecting multicollinearity is to obtain the pairwise correlation between explanatory variables. Correlation close to -1 or 1 suggests severe multicollinearity. \n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "cars = data.cars()\n", "cars = cars.drop(columns=['Name'])\n", "\n", "train_df, test_df = train_test_split(cars, test_size=0.2, random_state=123)\n", "X_train = train_df.iloc[:,1:]\n", "y_train = train_df.iloc[:,0:1]\n", "\n", "X_train_clean = X_train.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `check_multicollinearity()` function provides a simple and convenient way to detect multicollinearity in your dataset. Simply pass the cleaned training dataframe into the function and it will return the VIF for every numeric explanatory variable. You can also set the optional argument `threshold` such that the function returns only explanatory variables with a VIF beyond the set threshold. \n", "\n", "In the example below with the `cars` dataset, we are hoping to fit a linear regression to understand what factors of a car can affect its fuel efficiency. To verify whether multicollinearity exists in our explanatory variables ('Cylinders', 'Displacement', 'Horsepower', 'Weight_in_lbs', 'Acceleration'), we pass our training dataset into `check_multicollinearity()`. The output dataframe suggests our explanatory variables are highly correlated with each other, in particular `Weights_in_lbs` with a VIF of 126.9. In this case, it means differences in `Weight_in_lbs` between car makes can be well explained by other variables with high VIFs such as `Cylinders` and `Displacement`. \n", "\n", "It may be beneficial to revisit the research problem in hand and determine whether `Weight_in_lbs` should be included in our model. Domain knowledge will be particularly useful in guiding this decision." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "| \n", " | Features | \n", "VIF | \n", "
|---|---|---|
| 0 | \n", "Cylinders | \n", "96.549975 | \n", "
| 1 | \n", "Displacement | \n", "71.962976 | \n", "
| 2 | \n", "Horsepower | \n", "44.626008 | \n", "
| 3 | \n", "Weight_in_lbs | \n", "126.971443 | \n", "
| 4 | \n", "Acceleration | \n", "26.482060 | \n", "
| \n", " | Features | \n", "VIF | \n", "
|---|---|---|
| 0 | \n", "Cylinders | \n", "96.549975 | \n", "
| 1 | \n", "Displacement | \n", "71.962976 | \n", "
| 3 | \n", "Weight_in_lbs | \n", "126.971443 | \n", "
| \n", " | Name | \n", "Miles_per_Gallon | \n", "Cylinders | \n", "Displacement | \n", "Horsepower | \n", "Weight_in_lbs | \n", "Acceleration | \n", "Year | \n", "Origin | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "chevrolet chevelle malibu | \n", "18.0 | \n", "8 | \n", "307.0 | \n", "130.0 | \n", "3504 | \n", "12.0 | \n", "1970-01-01 | \n", "USA | \n", "
| 1 | \n", "buick skylark 320 | \n", "15.0 | \n", "8 | \n", "350.0 | \n", "165.0 | \n", "3693 | \n", "11.5 | \n", "1970-01-01 | \n", "USA | \n", "
| 2 | \n", "plymouth satellite | \n", "18.0 | \n", "8 | \n", "318.0 | \n", "150.0 | \n", "3436 | \n", "11.0 | \n", "1970-01-01 | \n", "USA | \n", "
| 3 | \n", "amc rebel sst | \n", "16.0 | \n", "8 | \n", "304.0 | \n", "150.0 | \n", "3433 | \n", "12.0 | \n", "1970-01-01 | \n", "USA | \n", "
| 4 | \n", "ford torino | \n", "17.0 | \n", "8 | \n", "302.0 | \n", "140.0 | \n", "3449 | \n", "10.5 | \n", "1970-01-01 | \n", "USA | \n", "
Pipeline(steps=[('preprocessor',\n",
" ColumnTransformer(transformers=[('standardscaler',\n",
" StandardScaler(),\n",
" ['Miles_per_Gallon',\n",
" 'Cylinders', 'Displacement',\n",
" 'Weight_in_lbs',\n",
" 'Acceleration']),\n",
" ('onehotencoder',\n",
" OneHotEncoder(), ['Origin']),\n",
" ('drop', 'drop', ['Name'])])),\n",
" ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessor',\n",
" ColumnTransformer(transformers=[('standardscaler',\n",
" StandardScaler(),\n",
" ['Miles_per_Gallon',\n",
" 'Cylinders', 'Displacement',\n",
" 'Weight_in_lbs',\n",
" 'Acceleration']),\n",
" ('onehotencoder',\n",
" OneHotEncoder(), ['Origin']),\n",
" ('drop', 'drop', ['Name'])])),\n",
" ('model', LinearRegression())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),\n",
" ['Miles_per_Gallon', 'Cylinders',\n",
" 'Displacement', 'Weight_in_lbs',\n",
" 'Acceleration']),\n",
" ('onehotencoder', OneHotEncoder(), ['Origin']),\n",
" ('drop', 'drop', ['Name'])])['Miles_per_Gallon', 'Cylinders', 'Displacement', 'Weight_in_lbs', 'Acceleration']
StandardScaler()
['Origin']
OneHotEncoder()
['Name']
drop
LinearRegression()