# Identifying and Addressing Multicollinearity in Regression Analysis

# What is Multicollinearity?

**Multicollinearity** is a condition that may occur during regression analysis where two or more independent variables are highly correlated with one another. While the strength of a regression model improves with increasing correlation between the predictors and the dependent variable, the presence of strong correlations among the independent variables can have a detrimental effect on model explainability and predictor standard error.

# Multicollinearity vs. Collinearity

Before continuing, we need to differentiate between two terms:** Collinearity **and** multicollinearity. Collinearity** refers to a linear relationship between two explanatory variables. **Multicollinearity** refers to the condition where two or more explanatory variables have a high level of collinearity. Collinearity is used as a general term to describe linear relationships, which can be strong or weak, while multicollinearity is a more specific term used to describe a scenario where collinearity among at least two predictors is present.

# Why is Multicollinearity a Problem?

For regression models, the impact of multicollinearity on model performance varies depending on the nature of the dataset and the model application. Multicollinearity primarily causes problems when retraining a model on new data or trying to explain the model using the regression coefficients. The presence of multicollinearity among two or more predictors makes it more difficult for a regression algorithm to assign each of these predictors with an accurate coefficient since it is harder to numerically distinguish predictors with a strong collinear relationship from one another.

Multicollinearity indicates an overlap in the explanatory information provided by two or more predictors. These predictors are providing redundant information, making it difficult for the regression algorithm to determine how much weight should be assigned to each. As a result, the regression algorithm may assign them different weights from a wide range of values under slightly different training scenarios, potentially even flipping the sign of the coefficients between positive and negative arbitrarily. This is reflected in the standard error of the predictors, which is higher for features that demonstrate multicollinearity than for those that do not.

While having dramatically different coefficients each time the model is retrained may not have a major impact on predictive performance, it introduces problems when trying to explain the importance of each predictor. The importance of highly correlated predictors will change each time the model is retrained, making it difficult to draw any meaningful conclusions about the real-world explanation of the model. The presence of multicollinearity can also obscure the relationship between each individual predictor variable and the dependent variable, further decreasing the model’s interpretability.

It is important to note that model interpretability can be significantly more important than model accuracy or predictive capability when building models in some client applications. For example, **marketing mix modeling** involves the analysis of regression models to determine the impact of various types of marketing spend as well as other exogenous variables on client KPIs such as sales, new customers, etc. For these applications, the client is typically more interested in the overall impact of each type of marketing spend (represented by regression coefficients) and how they can improve their resource allocation to maximize efficiency than the model’s predictive capacity.

# Identifying Multicollinearity

## 1. Correlation Matrix

A quick way to identify potential multicollinearity is to review the correlation matrix for the predictor variables. A correlation coefficient with an absolute value > 0.7 typically indicates a strong correlation between predictor variables, but it is important to note that this is just a rule of thumb. Removing some redundant predictors that are highly correlated can help reduce multicollinearity within your training data, improving the stability of the resulting model’s predictor coefficients when retraining. While a correlation matrix can help for identifying pairwise multicollinearity, it does not help with identification of higher order multicollinearity that may exist within groups of predictors.

As an example, we can look at the correlations among independent variables included in the wine quality dataset, a commonly used dataset for regression exercises. For simplicity, we will look at white wine only. This dataset includes eleven quantitative physicochemical measurements associated with 4,898 white wine samples (independent variables) as well as a quality score between 0 and 10 assigned by human judges (dependent variable). A Pearson correlation matrix for the standardized independent variables is shown below:

Here we can see that a strong negative correlation (r = -0.78) exists between density and alcohol. Density also has moderate negative correlation (-0.5 ≤ r ≤ -0.3) with three other variables (residual_sugar, chlorides, and total_sulfur_dioxide). The presence of these correlations suggest that multicollinearity among some predictors may be present but further analysis is needed.

## 2. Variance Inflation Factors (VIF)

Another standard approach to quantifying multicollinearity is calculation of the **Variance Inflation Factor (VIF)** for each independent variable included in the regression. VIF is calculated based on **tolerance **(Tol), as shown in the equations below:

In this equation, Rj represents the coefficient of determination calculated for a regression model predicting the jth independent variable using the remaining predictors. A higher VIF indicates a greater degree of collinearity between an independent variable and the other predictors included in the model. As an informal rule of thumb, multicollinearity is commonly defined by a VIF > 5 (tolerance < 0.2), with a VIF > 10 indicating a high degree of multicollinearity.

Taking the square root of the VIF provides a measurement of the magnitude of the standard error for the predictor coefficient relative to a scenario where the predictor was completely uncorrelated with all other independent variables. For example, if an independent variable had a VIF of 25, this would indicate that the standard error for that independent variable was 5 times larger than it would be if the predictor was uncorrelated with any of the other predictors.

For our example white wine quality dataset, we can look at the VIF values for each predictor calculated using scaled data to identify potential multicollinearity. The table on the left below indicates high VIF values (> 10) for density and residual_sugar, which matches what we observed with the correlation matrix. Removing the predictor with the highest observed VIF and correlation (density) produces the table on the right below. Notice how the VIF values are now less than 5 for all predictors, indicating reduced multicollinearity following removal of this independent variable.

## 3. Condition Number

The** condition number **can be used as an indicator of higher order multicollinearity that may not be detected using a correlation matrix or VIF values. Condition index values are determined using the eigenvalues calculated from a matrix of standardized predictors. Once all of the eigenvalues have been computed, **singular values **are determined by calculating the square root of each eigenvalue. The **condition index** for each singular value is then calculated based on the following equation:

In this equation, CIn represents the condition index for the nth eigenvalue, SVMax represents the maximum singular value for all n eigenvalues, and SVn represents the singular value calculated for the nth eigenvalue. The largest observed condition index value calculated for a set of eigenvalues is referred to as the **condition number**. A condition number of 10–30 indicates the presence of multicollinearity within the dataset. Condition numbers that exceed 30 indicate that the impact of multicollinearity is strong.

The condition number of a dataset is calculated automatically whenever training an ordinary least squares regression model using the statsmodels Python library. Printing a summary of a trained model produces the output below (shown for the white wine dataset, before removal of multicollinearity):

Notice that the condition number (shown towards the bottom right of the output) is equal to 14.7, indicating the presence of moderate multicollinearity within the dataset. We can also see that the standard error for the “density” predictor coefficient is higher than any of the other predictors. Similar to the reduction in VIF observed when removing the predictor demonstrating the highest degree of multicollinearity from the dataset, the condition number decreases into an acceptable range after the “density” predictor is removed:

The biggest limitation of the condition number as an indicator of multicollinearity is that it does not estimate the magnitude of each predictor’s contribution towards the observed multicollinearity. However, reviewing the condition number alongside the predictor coefficient standard error will help identify potential multicollinearity that may not be detected by other methods. This may prove useful for large datasets that may have complex collinear relationships between predictors.

# Tools for Managing Multicollinearity

In real-world scenarios, a small amount of low-level multicollinearity is expected among predictor variables. However, high levels of multicollinearity should be managed to improve model explainability and to reduce the potential for variability among predictor coefficients whenever the model is retrained.

In some cases, multicollinearity may be a product of the data collection process. Depending on the problem being evaluated, the potential exists for collection of redundant data, especially in domains where multiple individuals or machines may be independently collecting, recording, and uploading data to a single location. Once multicollinearity has been identified in a dataset, measures should be taken to understand why it may be occurring and to see if it can be reduced or eliminated through modifications to the data collection pipeline.

Multicollinearity within a dataset can be managed with several different strategies. Here are a few:

- Remove independent variables that are highly correlated
- Use a dimensionality reduction technique, such as
**principal component analysis**(PCA), to combine highly correlated features - Consider using a regression algorithm, such as
**ridge regression**, that is more well suited to handle multicollinearity

## Remove High-VIF / Highly Correlated Predictors

Removing independent variables that demonstrate a high VIF or high correlation coefficient can increase the model’s stability by reducing the standard error of the predictors. The largest downside of this approach is that some information will be lost whenever predictors are discarded. This can be seen in the model output images for the white wine model included in the previous section. Removal of the “density” feature reduces predictor standard error and VIFs as well as the dataset condition number, but it also decreases the R² value and increases the AIC. However, negative impacts are mitigated if the removed predictors had a high degree of overlap, making removal a viable option for features with high redundancy.

## Combine Correlated Predictors with PCA

Applying PCA or a similar dimensionality reduction technique to a dataset that demonstrates multicollinearity limits information loss and can produce a better fitting model than simply removing correlated features. However, combining variables in this manner makes it impossible to differentiate between their individual impacts. Once PCA is run and components have been selected to capture an acceptable level of variance, the effects of multicollinearity will be mitigated but the contributions of different predictors cannot be distinguished from one another since they have been modified by the PCA process.

When looking at the white wine dataset, we can first determine the number of principal components to keep by plotting percent of variance explained versus the number of components maintained:

We can explain nearly all of the variance by getting rid of one component. Training the model using nine components instead of the ten original features produces the following output:

The condition number has been reduced to 2.83 and the standard error for all predictors is low, indicating that multicollinearity has been managed effectively. However, we can now see that feature importance is tied to principal components instead of features with interpretable labels. Principal components act as “black boxes” containing pieces of information from multiple predictors. The explainability of the model is obscured using this approach.

## Ridge Regression

A requirement of an ordinary least squares regression model is that the predictors are unbiased, resulting in unbiased estimates. When multicollinearity is present, the lack of estimator bias can produce high variance among predictors. **Ridge regression** attempts to address this issue by adding some degree of bias to the predictors, decreasing predictor variance. The math behind ridge regression is beyond the scope of this article, but this page provides more information.

Ridge regression applies **L2 regularization** during the training process, adding an additional penalty term to the regression loss function. In L2 regularization, this penalty term is equal to the sum of the squared coefficient values multiplied by a tuning factor (lambda). The value of lambda must be adjusted between 0 and 1 to determine the amount of weight given to the penalty term.

A lambda value of 0 produces a regular ordinary least squares model. Increasing the value of lambda too much can over-penalize the coefficients, reducing the model’s fit and predictive power. However, tuning lambda to its optimal value can often help mitigate the impacts of multicollinearity while minimizing information loss, producing a well-fit model with reduced predictor coefficient standard error.

# Thanks for Reading!

Code used for the examples included in this article can be found in this Jupyter notebook. Please give this article some claps if you found it helpful!