Machine Learning Evaluation Metrics
There are many metrics used to evaluate machine learning models, each with their own pros and cons. We can broadly group metrics as classification and regression metrics.
Classification metrics are used to evaluate the performance of machine learning models that are used for predicting discrete values. Before proceeding to explain some frequently-used classification metrics, it is crucial to understand the confusion matrix shown below:
Accuracy: This measures the proportion of correct predictions made by the model. It is easy to understand and interpret, but can be misleading in imbalanced datasets.
Precision: This measures the proportion of true positive predictions out of all positive predictions made by the model. It is useful for evaluating models that need to make few false positive predictions, such as in medical diagnosis.
Recall: This measures the proportion of true positive predictions out of all actual positive instances. It is useful for evaluating models that need to correctly identify as many positive instances as possible, such as in fraud detection.
F-score: This is the harmonic mean of precision and recall. It is a balance between precision and recall, and it gives equal weight to both.
AUC-ROC Curve: This is a graphical representation of the performance of a binary classifier. It plots the true positive rate against the false positive rate. The area under the curve (AUC) represents the performance of the classifier. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
Log loss: This is used for evaluating the probability predictions of a classifier. It is used for logistic regression, maximum entropy and other models that produce probability scores. The lower the log loss, the better the model is.
Regression metrics are used to evaluate the performance of machine learning models that are used for predicting continuous values. Some common regression metrics include:
Mean Absolute Error (MAE): This measures the average absolute difference between the predicted and actual values. It is easy to understand and interpret, but can be sensitive to outliers.
Median Absolute Error (MedAE): It measures the difference between the predicted and actual values. It is calculated as the median of the absolute errors. Unlike Mean Absolute Error (MAE), MedAE is not sensitive to outliers and it gives a better representation of the data spread.
Max Error (ME) is a measure of the difference between the predicted and actual values. It is calculated as the maximum absolute error. It gives an idea of the worst-case scenario, which can be useful in some applications where it is more important to minimize the maximum error than the average error.
Mean Squared Error (MSE): This measures the average of the squared differences between the predicted and actual values. It is commonly used and easy to interpret, but also sensitive to outliers.
Root Mean Squared Error (RMSE): This is the square root of the mean squared error, it also measures the difference between the predicted and actual values, but it is in the same unit as the target variable.
R-squared (R²): This is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) . It ranges from 0 to 1, where a higher value indicates a better fit.
Adjusted R-squared: It is the same as the R-squared, but it adjusts for the number of independent variables in the model, in other words, it penalizes for adding unnecessary variables to the model.
Mean Absolute Percentage Error (MAPE): This measures the average absolute percentage difference between the predicted and actual values. It is easy to interpret, but can be sensitive to outliers, and it is not defined when the actual value is zero.
Weighted Mean Absolute Percentage Error (WMAPE): This is a variation of Mean Absolute Percentage Error (MAPE) where the absolute percentage error is weighted by the actual value. This metric is particularly useful when the range of the target variable is large and the relative error can be very different for different observations.
These are some of the most commonly used regression metrics, but it is important to consider the specific problem and the goals of the model when selecting a metric.
Updated on: 27/01/2023