Example usage

To use linreg_ally in a project:

import linreg_ally

print(linreg_ally.__version__)

0.0.1

# Imports
from vega_datasets import data
from linreg_ally.eda import eda_summary
from linreg_ally.multicollinearity import check_multicollinearity
from linreg_ally.models import run_linear_regression
from linreg_ally.plotting import qq_and_residuals_plot
from sklearn.model_selection import train_test_split 

Exploratory Data Analysis (EDA)

Since we are using the cars dataset from the Vega datasets package, it will be helpful to see whether there is a difference among the distributions of the various numerical features when looking at the origin of the car. Such an EDA can easily be achieved using the function eda_summary from linreg_ally.eda.

Function usage

# Load data
cars = data.cars()

# Run EDA by subsetting on origin
eda_plot = eda_summary(cars, color='Origin')

# Show the EDA plot
eda_plot.show()

Interpreting the EDA plot

Using the eda_summary function and observing the plot, there is clear evidence that there is a difference among the distributions of the car mileage for cars from the three regions of interest. Here, it is evident that the average gas mileage in miles per gallon (MPG) for US cars is the lowest among the different cars with Japan having the best average gas mileage. Similarly, it can be observed that there might possibly be some association between the horsepower and the displacement due to how similar the distributions look between the two features.

Check Multicollinearity

Multicollinearity exists when two or more explantory variables in a regression model are correlated. High degree of multicolinearity in a regression model is problematic as it can make coefficient estimates unstable. In the case where there is perfect correlation between two explanatory variables, it can even cause a regression model to fail as it will be impossible to assess how the target variable is affected by a unit change in an explantory variable when holding all other explantory variables constant.

Multicollinearity can be checked through the Variance Inflator Factors (“VIF”). VIF of a given explanatory variable $X_i$ is computed by $\frac{1}{(1 - R^2_{X_i,\dots,X_{i-1}})}$ where $R^2_{X_i,\dots,X_{i-1}}$ is the coefficient of determination from regressing $X_i$ against all other explanatory variables. Typically VIFs between 1 and 5 suggest that there is moderate correlation, but it is not severe enough to warrant any corrective measures. However, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients will be poorly estimated.

Another way of detecting multicollinearity is to obtain the pairwise correlation between explanatory variables. Correlation close to -1 or 1 suggests severe multicollinearity.

cars = data.cars()
cars = cars.drop(columns=['Name'])

train_df, test_df = train_test_split(cars, test_size=0.2, random_state=123)
X_train = train_df.iloc[:,1:]
y_train = train_df.iloc[:,0:1]

X_train_clean = X_train.dropna()

The check_multicollinearity() function provides a simple and convenient way to detect multicollinearity in your dataset. Simply pass the cleaned training dataframe into the function and it will return the VIF for every numeric explanatory variable. You can also set the optional argument threshold such that the function returns only explanatory variables with a VIF beyond the set threshold.

In the example below with the cars dataset, we are hoping to fit a linear regression to understand what factors of a car can affect its fuel efficiency. To verify whether multicollinearity exists in our explanatory variables (‘Cylinders’, ‘Displacement’, ‘Horsepower’, ‘Weight_in_lbs’, ‘Acceleration’), we pass our training dataset into check_multicollinearity(). The output dataframe suggests our explanatory variables are highly correlated with each other, in particular Weights_in_lbs with a VIF of 126.9. In this case, it means differences in Weight_in_lbs between car makes can be well explained by other variables with high VIFs such as Cylinders and Displacement.

It may be beneficial to revisit the research problem in hand and determine whether Weight_in_lbs should be included in our model. Domain knowledge will be particularly useful in guiding this decision.

vif_df = check_multicollinearity(X_train_clean, vif_only=True)
vif_df

	Features	VIF
0	Cylinders	96.549975
1	Displacement	71.962976
2	Horsepower	44.626008
3	Weight_in_lbs	126.971443
4	Acceleration	26.482060

vif_df_w_threshold = check_multicollinearity(X_train_clean, threshold=50, vif_only=True)
vif_df_w_threshold

	Features	VIF
0	Cylinders	96.549975
1	Displacement	71.962976
3	Weight_in_lbs	126.971443

check_multicollinearity() also allow users to obtain pairwise Pearson Correlation for the explanatory variables in the training set. To do so, simply set the argument vif_only as FALSE.

Back to our example with the cars dataset, the pairwise Pearson Correlations seem to agree with our VIF analysis that the explanatory variables are highly correlated with each other. Considerations should go into whether some of these variables should be dropped to ensure robustness of the regression model.

vif_df, corr_chart = check_multicollinearity(X_train_clean, vif_only=False)
corr_chart

Running Linear Regression Tutorial

In this tutorial, you will learn a streamlined way to preprocess data, run linear regression and output with scoring metrics.

First, ensure you have the models package imported.

from linreg_ally.models import run_linear_regression

We will be using the cars dataset provided by vega_datasets. This dataset contains various features related to cars, including both numerical and categorical variables, making it ideal for demonstrating the full capabilities of our linear regression function.

df = data.cars()
df.head()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

As shown above, the dataset includes data about different car models, featuring attributes such as Miles_per_Gallon, Cylinders, Displacement etc. We will utilize these attributes to build a linear regression model, predicting the target variable Horsepower.

We will first perform some data cleaning by removing columns that contain NA values.

df = df[['Horsepower', 'Displacement']].dropna()

With the dataset loaded, you’re all set to move forward to the next step: using our package’s run_linear_regression function to prepare the data, fit a model, and evaluate its performance.

We will specify the target_column, numeric_feats, categorical_feats and drop_feats. In this case, target_column will be Horsepower since we are trying to predict its value. numeric_feats will be all the numeric features that we want to scale using scikit-learn’s StandardScaler. categorical_feats will be the categorical features (in this case only Origin) that we want to perform one-hot encoding on using scikit-learn’s OneHotEncoder. drop_feats will be the columns that we do not want to include in the analysis, in which in this case will be Name since it does not provide any meaningful information to the analysis.

For the scoring_metrics, we will specify r2 to evaluate the performance of the model on test data.

df = data.cars()
df = df.dropna()

# Define parameters for run_linear_regression
target_column = "Horsepower"
numeric_feats = ["Miles_per_Gallon", "Cylinders", "Displacement", "Weight_in_lbs", "Acceleration"] 
categorical_feats = ["Origin"]
drop_feats = ["Name"]
random_state = 123
scoring_metrics = ["r2"]

best_model, X_train, X_test, y_train, y_test, scores = run_linear_regression(
    dataframe=df,
    target_column=target_column,
    numeric_feats=numeric_feats,
    categorical_feats=categorical_feats,
    drop_feats=drop_feats,
    random_state=random_state,
    scoring_metrics=scoring_metrics
)

Model Summary
------------------------
Test r2: 0.846

best_model provides a visual summary of the steps used in the entire linear regression pipeline.

best_model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['Miles_per_Gallon',
                                                   'Cylinders', 'Displacement',
                                                   'Weight_in_lbs',
                                                   'Acceleration']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(), ['Origin']),
                                                 ('drop', 'drop', ['Name'])])),
                ('model', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Scores give the R² and negative mean squared error scores that we are interested in finding out in order to understand how the model performs on the test data.

scores

{'r2': 0.8463952369304465}

As shown above, an R² score of 85% indicates that 85% of the variance in the dependent variable can be explained by the independent variables included in the model, showing that the model provides a good fit to the data.

However, R² alone does not tell the whole story, for example if there might be multicollinearity or other issues. You might also want to consider other metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or visually inspect residual plots to gain a more comprehensive understanding of model performance.

This is the end of this tutorial where you have seen how we use the run_linear_regression function in our package to preprocess data, run linear regression and output with scoring metrics.

Checking Normality and Homoscedasticity of Residuals

A linear regression model assumes that residuals are normally distributed and have constant variance (homoscedasticity). To check whether these assumptions are met, we use the qq_and_residuals_plot function. This function generates:

A Quantile-Quantile (Q-Q) plot to assess the normality of residuals.
A Residuals vs. Fitted Values plot to check for homoscedasticity.

The qq_and_residuals_plot function takes two parameters: y_actual and y_predicted. These values were extracted from the linear regression model we previously created.

# y_actual is y_test (true labels)
y_actual = y_test

# y_predicted is obtained by predicting on X_test
y_predicted = best_model.predict(X_test)

Now that y_actual and y_predicted have been extracted, let’s pass these parameters to the qq_and_residuals_plot function.

qq_and_residuals_plot(y_actual, y_predicted)

Interpreting the Q-Q Plot

If the Q-Q plot shows a significant deviation from the red dashed line (which represents perfect normality), the residuals are not normally distributed. In our plot, a few points deviate from the line at the tails, suggesting potential skewness or outliers. However, since these deviations are minor, we can conclude that the residuals are approximately normal.

Interpreting the Residuals vs. Fitted Values Plot

For the homoscedasticity assumption to hold, residuals should be randomly scattered around the red dashed line in the Residuals vs. Fitted Values plot. This would indicate that residual variance remains constant across all fitted values (homoscedasticity).

However, in our case, the residuals cluster at different fitted value ranges, and the spread increases as the fitted values increase, suggesting that the variance is not constant (heteroscedasticity).

Implications of Assumption Violations

If the normality assumption is violated: Ordinary Least Squares (OLS) regression still produces best linear unbiased estimates (BLUE) as long as independence and homoscedasticity hold. However, hypothesis tests and confidence intervals may be misleading if residuals deviate significantly from normality.

If the homoscedasticity assumption is violated: You can still fit a linear regression model, but you should interpret results with caution. The estimated coefficients remain unbiased, but standard errors and p-values become unreliable, affecting statistical inference.

Conclusion

The qq_and_residuals_plot function is a valuable tool for assessing the normality and homoscedasticity assumptions in linear regression. If these assumptions are violated, you should consider corrective measures such as:

Transforming variables (e.g., logarithmic transformation),
Using robust standard errors, or
Exploring alternative models (e.g., weighted least squares, generalized least squares).