Choosing an evaluation metric to assess model performance is an important element of the data analysis pipeline. By properly selecting an evaluation metric, or equation used to objectively assess a model’s performance, we can get a good idea how closely the results produced by our model match real-world observations. We can use the evaluation metric to determine if improvements should be made to model parameters or if another modeling method should be considered.
While choosing an evaluation metric seems like a simple task, there are a number of different metrics available. Each method has strengths and weaknesses depending on the type of data being evaluated, the type of model being assessed, and the domain or field of interest in which the metric needs to be communicated.
Different types of modeling problems require different types of evaluation metrics. Classification and regression problems, for example, each present data in a completely different format, making the use of a single equation for both types of problem nearly impossible. As a result, each type of machine learning problem has its own unique set of applicable evaluation metrics.
The following sections provide a brief introduction to some most common evaluation metrics that can be used to assess the performance of regression models.
The phrase regression analysis refers to the characterization of relationships between a dependent and one or more independent variables using statistical methods or models. There are a number of different forms of regression analysis, but the simplest and most common is linear regression. Linear regression involves creation of a line using a linear equation with parameters that are modified to fit the data. An example of a simple linear regression line created based on a synthetic dataset is shown below:
Other forms or regression work with nonlinear equations and incorporate additional statistical techniques to try and improve their ability to accurately model real-world data.
Regression analysis is a powerful tool that can be used to model continuous or time-series data. Regression models can be used to interpolate values within the boundaries of observed independent variable value ranges or to extrapolate / forecast values outside of the range of independent variable values used to develop the model.
In order for regression to be useful, we need to have a method that we can use to determine how well a regression model performs. Evaluation metrics are used for this purpose, providing a means to objectively assess the performance of a regression model by quantifying how well the model fits the data (its goodness-of-fit). This is accomplished by comparing actual, known values from a set of data with corresponding values predicted by a regression model and determining how strongly the two sets of data differ from one another.
Error, Residuals and Residual Plots
For regression problems, the difference between known values and the fitted regression line is referred to as error or residuals. Plotting the residuals produced by a model with respect to the independent variable used to generate the model can offer insights into whether or not the modeling method that was implemented was appropriate for the data. Residual plots are typically created to determine if the data can be modeled linearly or if it contains nonlinear relationships.
For a strong linear model that demonstrates a high goodness-of-fit, residuals should be distributed randomly. The plot below illustrates residuals for a linear model that performs well on a set of data with a strong linear relationship:
If a non-random pattern is identified in the residual plot, this may indicate that a nonlinear model may be more appropriate. The plot below illustrates residuals for a linear model developed for data that is defined by an underlying nonlinear relationship:
Plotting residuals can only provide a limited amount of information regarding model performance. In order to objectively assess the performance of a regression model, we need to select an appropriate evaluation metric that takes our observed residuals as inputs and produces an interpretable value.
Common Regression Evaluation Metrics
Several different evaluation metrics have been developed by statisticians for evaluation of regression model performance. Each of the metrics described below quantifies the error associated with a model using a different method. The list below is not exhaustive, but it includes some of the most commonly encountered evaluation metrics when dealing with regression problems.
Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) refers to the mean value of the absolute error values calculated for each point in the dataset. MAE is calculated using the following equation:
The absolute error is calculated for each pair of predicted and actual values by taking the absolute value of the difference between the two values. The absolute error terms calculated for each pair of values are then summed, and the resulting quantity is divided by the total number of observations. The resulting value represents the mean absolute error, or the average vertical distance between each pair of predicted and actual values when graphed.
Interpreting MAE is straightforward since it is represented in the same units as our original data. A perfect model produces an MAE of zero, and the closer the observed MAE is to zero the better the model fits the data. Calculation of MAE treats all penalties equally, regardless of whether the predicted value is smaller or larger than the actual value.
MAE also does not scale the amount of penalty placed on an error based on its size. The other metrics we will look at in this article apply greater penalties to errors as they increase in size. As a result, MAE may be an appropriate metric to use when there is not a need to apply greater penalties for outliers or for datasets that contain few or no outliers.
The value of MAE is typically relatively close to the value of root mean square error (RMSE), another metric we will be looking at later in this article. However, MAE will never be higher than RMSE. MAE is also relatively easy to interpret at a human level since its calculation does not require squaring or taking roots. As a result, MAE may be appropriate for circumstances where error values need to be easily explained or presented to a non-technical audience.
Mean Bias Error (MBE)
The Mean Bias Error (MBE) is calculated in a similar manner to MAE. An equation for calculating MBE is shown below.
When calculating MBE, the bias (the difference between the predicted and actual value) is calculated rather than the absolute error. Since the absolute value of the bias is not taken, both negative and positive values are possible. This differs from calculation of MAE, where only values greater than or equal to zero are possible.
MBE represents the average prediction bias produced by the model and does not serve as a measurement of the model’s error. Large negative bias values can offset larger positive bias values, producing models that demonstrate a low MBE but a high MAE. The primary benefit of calculating MBE for a model is to identify and work to address model bias.
Mean Squared Error (MSE)
The Mean Squared Error (MSE) refers to the mean value of the squared error values calculated for each datapoint.
This equation resembles the equations we have seen previously for MAE and MBE. Squaring each bias term ensures that the MSE will be greater than or equal to zero. Since each bias value is squared before the sum of all terms is taken, the impact of larger observed errors on the total error value is exponentially larger than the impact of smaller observed errors. The greater the observed error value, the larger the penalty that is applied when calculating MSE.
MSE is generally a more popular regression evaluation metric than MAE. The additional emphasis placed by MSE on large errors is desirable since we want to produce a model that generalizes well and produces predictions with low errors across the full dataset.
Like MAE, an MSE value closer to zero indicates better model performance. MSE is used less frequently for human interpretation since it is not represented in easily interpretable units, but it is very popular for use in machine learning optimization.
MSE may also be calculated using data that was not included when training the model. Under these circumstances, it is known as Mean Squared Prediction Error (MSPE), and its calculation allows for model cross validation.
One advantage of MSE is that it can also be broken down into bias, variance and noise according to the following equation:
The variance can be interpreted as the amount our predictions would change if they were developed using a different set of training data. The bias represents the error associated incorrect assumptions made by our model. Noise represents random irregularities found in real-world data. This quantity represents the irreducible error that cannot be eliminated through creation of better models.
Much has been written about the tradeoff between bias and variance when developing statistical models. A model with high bias tends to underfit the data, while a model with high variance tends to overfit. Minimization of MSE has an underlying goal of minimizing both variance and bias, but decomposing MSE into its constituent quantities can offer insight into the balance between the two.
Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE) is one of the most popular evaluation metrics for regression problems. RMSE is calculated by taking the square root of the MSE:
The resulting value can be interpreted in the same units as the value we are trying to predict, which makes it easier to understand than some other metrics. However, it is important to remember that RMSE values can only be compared between models that measure error using the same units.
Like the other evaluation metrics we have discussed, a lower value of RMSE indicates better model performance. RMSE should be used over MAE or other evaluation metrics in situations where the observed data has an asymmetric conditional distribution. If MAE is used as an objective function that is to be minimized when training a machine learning model, it will produce a biased fit that is closer to the median than would be achieved if RMSE was used.
RMSE also represents the standard deviation of the residuals produced by our model. Since the mean value of the residuals is always equal to zero, knowing the standard deviation of the residuals gives us an idea of how closely our residuals are centered around zero.
Both RMSE and MSE are preferable metrics to use for datasets that contain several outliers. Applying an additional penalty for larger differences between predicted and actual values will help the model accommodate outliers during the fitting process.
R² / Coefficient of Determination
The R² value (also referred to as the coefficient of determination) quantifies how closely the known data values are to the fitted regression line. Values of R² typically range from 0.0 to 1.0. A value of R² closer to 1.0 indicates a stronger model fit.
There are multiple methods that may be used to calculate the coefficient of determination. One common method for calculating this quantity first requires the calculation of two additional quantities: the residual sum of squares (RSS) and the total sum of squares (TSS). The RSS is calculated by simply summing up the squares of all residuals observed between the predicted and actual values:
The TSS is calculated by taking the difference between each observed value and the mean of all observed values, squaring the difference, and summing the squared differences:
The calculated RSS and TSS values can then be used to calculated R² using the following equation:
R² can be interpreted as the proportion of variance explained by the model. A value of R² = 1.0 indicates a perfect model fit, meaning that the model correctly predicts actual values with no error. In this circumstance, the model explains 100% of the variance of the data around its mean.
A value of R² = 0.0 indicates that the model produces a perfectly horizontal line. In this circumstance, the model explains 0% of the variance of the data around its mean.
It is possible to achieve an R² value less than zero. This occurs when the selected model does not match the trend observed in the data. Trend lines that fit the data worse than a horizontal line can produce negative R² values.
Thanks for Reading!
This list of regression evaluation metrics is far from comprehensive, but it covers the most commonly encountered metrics used for standard data analysis. Be sure to give this article some claps if you found it interesting or useful!
- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013. Accessed at: https://statlearning.com/
- Metrics and Scoring: Quantifying the Quality of Predictions (scikit-learn Documentation)