Error Metrics | Towards Data Science

How to Measure Real Model Accuracy When Labels Are Noisy

Krishna Rao — Thu, 10 Apr 2025 19:22:26 +0000

Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this article, we explore how to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The images are labeled by human annotators who are known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train an image classifier on some of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what is the “true” accuracy of the model (Aᵗʳᵘᵉ)? A couple of observations first:

Within the 90% of predictions that the model got “right,” some examples may have been incorrectly labeled, meaning both the model and the ground truth are wrong. This artificially inflates the measured accuracy.
Conversely, within the 10% of “incorrect” predictions, some may actually be cases where the model is right and the ground truth label is wrong. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by author.

The true accuracy of our model depends on how its errors correlate with the errors in the ground truth labels. If our model’s errors perfectly overlap with the ground truth errors (i.e., the model is wrong in exactly the same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is wrong in exactly the opposite way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s important to note that the model’s true accuracy can be both lower and higher than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies among labels are randomly spread among the examples and not systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies in the labels, we can derive a more precise estimate of its true accuracy.

When we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the ground truth label. This can happen in two scenarios:

Both model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
Both model and ground truth are wrong (in the same way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we can express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is within the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by author.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is always higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even if we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by author.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs look like cats, then both the ground truth and model errors are likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper bound.

More generally, model errors tend to be correlated with ground truth errors when:

Both humans and models struggle with the same “difficult” examples (e.g., ambiguous images, edge cases)
The model has learned the same biases present in the human labeling process
Certain classes or examples are inherently ambiguous or challenging for any classifier, human or machine
The labels themselves are generated from another model
There are too many classes (and thus too many different ways of being wrong)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for proper model evaluation, especially in domains where obtaining perfect ground truth is impossible or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

Conduct targeted error analysis: Examine examples where the model disagrees with ground truth to identify potential ground truth errors.
Consider the correlation between errors: If you suspect correlation between model and ground truth errors, the true accuracy is likely closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
Obtain multiple independent annotations: Having multiple annotators can help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

The range of possible true accuracy depends on the error rate in the ground truth
When errors are independent, the true accuracy is often higher than measured for models better than random chance
In real-world scenarios, errors are rarely independent, and the true accuracy is likely closer to the lower bound

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

Mean Average Precision at K (MAP@K) clearly explained

Konstantin Rink — Wed, 18 Jan 2023 14:46:58 +0000

Mean Average Precision at K (MAP@K) is one of the most commonly used evaluation metrics for recommender systems and other ranking related classification tasks. Since this metric is a composition of different Error Metrics or layers, it may not be that easy to understand at first glance.

This article explains MAP@K and its components step by step. At the end of this article you will also find code snippets how to calculate the metric. But before diving into each part of the metric, let’s talk about the WHY first.

WHY use MAP@K?

MAP@K is an error metric that can be used when the sequence or ranking of your recommended items plays an important role or is the objective of your task. By using it, you get answers to the following questions:

Are my generated or predicted recommendations relevant?
Are the most relevant recommendations on the first ranks?

Making the following steps easier to understand

Now that you know the WHY, let’s talk about the HOW. The following chapters explain step by step in an "onion" style, from the inside (starting with Precision P) to the outside (MAP@K), MAP@K’s structure.

To make the steps and their composition easier to understand, we work with the following example: We want to evaluate our recommender system, which recommends six items to potential customers when visiting a product detail page (figure 1).

Figure 1. Recommendation example (image by author).

Precision (P)

You might have already heard of precision in books or articles when you learned about error metrics for classification models. Precision can be seen as a measure of quality. High precision means that our model returns more relevant than irrelevant results or recommendations.

Precision can be defined as the fraction of relevant items in all recommended items (relevant + irrelevant items).

Figure 2. Precision formula (image by author).

The example below (figure 3) shows 6 recommended items. Out of these 6 recommendations, 2 are relevant.

Figure 3. Precision example (image by author).

By putting these values in our formula (figure 1), we get a precision of 0.33 ( 2 relevant items / (2 relevant + 4 irrelevant items) ).

Precision@K (P@K)

The precision metric (figure 2) itself does not consider the rank or order in which the relevant items appear. Time to include the ranks to our precision formula. Precision@K can be defined as the fraction of relevant items in the top K recommended items (figure 4).

Figure 4. Precision@K formula (image by author).

The following figure (figure 5) shows our example from above (figure 3) in a ranked scenario.

Figure 5. Precision@K example (image by author).

The Precision@K column shows for each rank (1 to 6) the Precision@K value. The K stands for the number of ranks (1, 2, …, 6) we consider.

Precision@1

Assuming we would consider only the first rank (K=1), we would then have 0 relevant items divided by 1 (total items), which leads to 0 for Precision@1.

Precision@3

Let’s assume we consider the first three ranks (K=3) we then would have under the top 3 recommendations 1 relevant item and 2 irrelevant ones. If we place these numbers in our formula (figure 3), we would get 0.33(1 relevant item / (1 relevant + 2 irrelevant items) ).

Precision@5

Last but not least, let’s consider the first five ranks (K=5). We then would have under the first 5 recommendations 2 relevant and 3 irrelevant ones. If we do the math again, we would get a value of 0.4 (2 relevant items / (2 relevant + 3 irrelevant items)).

Average Precision@K (AP@K)

As we have seen, Precision and Precision@K are pretty straightforward. The next step has a bit more complexity.

The Average Precision@K or AP@K is the sum of precision@K where the item at the kₜₕ rank is relevant (rel(k)) divided by the total number of relevant items (r) in the top K recommendations (figure 6).

Figure 6. AP@K formula (image by author).

Confused? Let’s have a look at the following example for AP@6 (figure 7).

Figure 7. AP@K example 1 (image by author).

The total number of relevant items (r) is in this example 2 (on the ranks 2 and 4). Therefore, we can place the 2 in the fraction 1/r.

Looking at the first rank 1. The Precision@1 is 0 and the item is not relevant (grey). Therefore, we multiply it’s Precision@1 value with 0 which leads to 0 * 0.

On the 2nd rank however, we have a Precision@2 value of 0.5 and a relevant item on the 2nd rank. This leads to 0.5 * 1 .

On the 3rd rank we have again an irrelevant item and a Precision@3 of 0.33. This results in 0.33 * 0.

We would proceed here by going over each rank. If the rank k contains a relevant item, we would multiply its Precision@k by 1. In case it is irrelevant, we multiply it with 0, which means it has no effect in increasing our summation.

The end result for this example would be a AP@6 of 0.5.

Before we move on to the last step, do you remember when we said at the beginning that MAP@K can answer the question:

Are the most relevant recommendations on the first ranks?

A great characteristic of this metric is that it penalizes relevant items in the lower ranks. To give you a better understanding, let’s look at the following example (figure 8).

Figure 8. Example for relevant items on different ranks. Best case (left), worst case (right). The higher the better (image by author).

Compared to the initial example (figure 7), the number of relevant items has not changed. But what has changed is the rank in which they are placed. AP@K (and so then MAP@K) penalizes your recommendations or model if the relevant items are placed on lower ranks.

Mean Average Precision@K (MAP@K)

The previous steps and examples were all based on evaluating one single query or one single list of recommendations one visitor gets when browsing the product detail page of product X. But we have more than one visitor…

Mean Average Precision@K or MAP@K considers that. It averages the AP@K for recommendations shown to M users.

Figure 9. MAP@K formula (image by author).

Please note: For the sake of simplicity I chose in this example "users". However, depending on your case, M could be also e.g., search queries.

To get a better idea of how MAP@K is calculated, the following example can be seen below (figure 10).

Figure 10. Example of MAP@K (image by author).

Based on 3 different users (or queries), the MAP@6 is 0.59.

Coding

Now that we are familiar with the theory, let’s do some coding. We will work in the following examples with two lists actuals and predicted, that contain product_ids.

The goal is to check if our actual values appear in our predicted list and, if yes, on which rank/place they appear. That’s why the order of the items does not matter in our actuals but in our predicted list.

AP@K

The following lists reflect the example we used earlier (figure 11).

Figure 11. Recap example from the start (image by author).

actuals = ['p_a', 'p_b']
predicted = ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f']

If we compare the actuals vs. the predicted, then we can see that p_b shows up on the 2nd rank and p_d on the 4th one.

def apk(y_true, y_pred, k_max=0):

  # Check if all elements in lists are unique
  if len(set(y_true)) != len(y_true):
    raise ValueError("Values in y_true are not unique")

  if len(set(y_pred)) != len(y_pred):
    raise ValueError("Values in y_pred are not unique")

  if k_max != 0:
    y_pred = y_pred[:k_max]

  correct_predictions = 0
  running_sum = 0

  for i, yp_item in enumerate(y_pred):

    k = i+1 # our rank starts at 1

    if yp_item in y_true:
      correct_predictions += 1
      running_sum += correct_predictions/k

  return running_sum/len(y_true)

If we place our two lists in our function

apk(actuals, predicted)

then we get 0.5 like in our manual example (figure 7).

MAP@K

Since MAP@K averages over multiple queries, we adjust our lists to the following structure:

actual = ['p_a', 'p_b']

predic = [
    ['p_a', 'p_b', 'p_c', 'p_d', 'p_e', 'p_f'],
    ['p_c', 'p_d', 'p_e', 'p_f', 'p_a', 'p_b'],
    ['p_d', 'p_a', 'p_c', 'p_b', 'p_e', 'p_f'],
]

Our actuals stay the same but our predicted list contains now several lists (one for each query). The predicted lists correspond to the ones from figure 10.

Figure 12. Recap example from figure 10 (image by author).

The code below shows the calculation of MAP@K:

import numpy as np
from itertools import product

def mapk(actuals, predicted, k=0):
  return np.mean([apk(a,p,k) for a,p in product([actual], predicted)])

If we place our lists in this function

mapk(actuals, predicted)

then we get 0.59.

Conclusion

When evaluating Recommender Systems or ranking models, MAP@K is a great choice. It not only provides insights if your recommendations are relevant but also considers the rank of your correct predictions.

Due to its ranking considerations, it is a bit more challenging to understand than other standard error metrics like the F1 score. But I hope that this article gave you a comprehensive explanation of how this error metric is calculated and implemented.

If you want to measure not only if your recommendations are relevant but also how relevant they are, feel free to check out Normalized Discounted Cumulative Gain (NDCG).

Sources

Introduction to Information Retrieval – Evaluation, https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-1-per.pdf

The post Mean Average Precision at K (MAP@K) clearly explained appeared first on Towards Data Science.

Time Series Forecast Error Metrics you should know

Konstantin Rink — Thu, 21 Oct 2021 19:00:52 +0000

Hands-on Tutorials

Time Series Forecast Error Metrics You Should Know

Photo by Ksenia Chernaya on Pexels.

Using the right Error Metrics in your Data Science project is crucial. A wrong error metric will not only affect your model’s optimization (loss function) but also might skew your judgment of models.

Besides the classical error metrics like Mean Absolute Error, more and more new error metrics are being developed and published regularly.

The idea of this article is to not only provide you an overview about the most used ones but also show you how they are calculated as well as their advantages and disadvantages.

Before we start, please keep in mind that there is no silver bullet, no single best error metric. The fundamental challenge is, that every statistical measure condenses a large number of data into a single value, so it only provides one projection of the model errors emphasizing a certain aspect of the error characteristics of the model performance (Chai and Draxler 2014).

Therefore it is better to have a more practical and pragmatic view and work with a selection of metrics that fit for your use case or project.

To identify the most used or common error metrics, I screened over 12 time series Forecasting frameworks or libraries (i.e. kats, sktime, darts) and checked what error metrics they offer. Out of these 12 I identified the top 8 most common forecasting error metrics and grouped them into four categories (see figure 1) proposed by Hyndman and Koehler (2006).

Figure 1. Overview Time Series Forecast Error Metrics (image by author).

Scale Dependent Metrics

Many popular Metrics are referred to as scale-dependent (Hyndman, 2006). Scale-dependent means the error metrics are expressed in the units (i.e. Dollars, Inches, etc.) of the underlying data.

The main advantage of scale dependent metrics is that they are usually easy to calculate and interpret. However, they can not be used to compare different series, because of their scale dependency (Hyndman, 2006).

Please note here that Hyndman (2006) includes Mean Squared Error into a scale-dependent group (claiming that the error is "on the same scale as the data"). However, Mean Squared Error has a dimension of the squared scale/unit. To bring MSE to the data’s unit we need to take the square root which leads to another metric, the RMSE. (Shcherbakov et al., 2013)

Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) is calculated by taking the mean of the absolute differences between the actual values (also called y) and the predicted values (y_hat).

Simple, isn’t it? And that’s its major advantage. It is easy to understand (even for business users) and to compute. It is recommended for assessing accuracy on a single series (Hyndman, 2006). However if you want to compare different series (with different units) it is not suitable. Also you should not use it if you want to penalize outliers.

Mean Squared Error (MSE)

If you want to put more attention on outliers (huge errors) you can consider the Mean Squared Error (MSE). Like it’s name implies it takes the mean of the squared errors (differences between y and y_hat). Due to its squaring, it heavily weights large errors more than small ones, which can be in some situations a disadvantage. Therefore the MSE is suitable for situations where you really want to focus on large errors. Also keep in mind that due to its squaring the metric loses its unit.

Root Mean Squared Error (RMSE)

To avoid the MSE’s loss of its unit we can take the square root of it. The outcome is then a new error metric called the Root Mean Squared Error (RMSE).

It comes with the same advantages as its siblings MAE and MSE. However, like MSE, it is also sensitive to outliers.

Some authors like Willmott and Matsuura (2005) argue that the RMSE is an inappropriate and misinterpreted measure of an average error and recommend MAE over RMSE.

However, Chai and Drexler (2014) partially refuted their arguments and recommend RMSE over MAE for your model optimization as well as for evaluating different models where the error distribution is expected to be Gaussian.

Percentage Error Metrics

As we know from the previous chapter, scale dependent metrics are not suitable for comparing different time series.

Percentage Error Metrics solve this problem. They are scale independent and used to compare forecast performance between different time series. However, their weak spots are zero values in a time series. Then they become infinite or undefined which makes them not interpretable (Hyndman 2006).

Mean Absolute Percentage Error (MAPE)

The mean absolute percentage error (MAPE) is one of the most popular used error metrics in time series forecasting. It is calculated by taking the average (mean) of the absolute difference between actuals and predicted values divided by the actuals.

Please note, some MAPE formulas do not multiply the result(s) with 100. However, the MAPE is presented as a percentage unit so I added the multiplication.

MAPE’s advantages are it’s scale-independency and easy interpretability. As said at the beginning, percentage error metrics can be used to compare the outcome of multiple time series models with different scales.

However, MAPE also comes with some disadvantages. First, it generates infinite or undefined values for zero or close-to-zero actual values (Kim and Kim 2016).

Second, it also puts a heavier penalty on negative than on positive errors which leads to an asymmetry (Hyndman 2014).

And last but not least, MAPE can not be used when using percentages make no sense. This is for example the case when measuring temperatures. The units Fahrenheit or Celsius scales have relatively arbitrary zero points, and it makes no sense to talk about percentages (Hyndman and Koehler, 2006).

Symmetric Mean Absolute Percentage Error (sMAPE)

To avoid the asymmetry of the MAPE a new error metric was proposed. The Symmetric Mean Absolute Percentage Error (sMAPE). The sMAPE is probably one of the most controversial error metrics, since not only different definitions or formulas exist but also critics claim that this metric is not symmetric as the name suggests (Goodwin and Lawton, 1999).

The original idea of an "adjusted MAPE" was proposed by Armstrong (1985). However by his definition the error metric can be negative or infinite since the values in the denominator are not set absolute (which is then correctly mentioned as a disadvantage in some articles that follow his definition).

Makridakis (1993) proposed a similar metric and called it SMAPE. His formula which can be seen below avoids the problems Armstrong’s formula had by setting the values in the denominator to absolute (Hyndman, 2014).

Note: Makridakis (1993) proposed the formula above in his paper "Accuracy measures: theoretical and practical concerns”. Later in his publication (Makridakis and Hibbon, 2000) "The M3-Competition: results, conclusions and implications” he used Armstrong’s formula (Hyndman, 2014). This fact has probably also contributed to the confusion about SMAPE’s different definitions.

The sAMPE is the average across all forecasts made for a given horizon. It’s advantages are that it avoids MAPE’s problem of large errors when y-values are close to zero and the large difference between the absolute percentage errors when y is greater than y-hat and vice versa. Unlike MAPE which has no limits, it fluctuates between 0% and 200% (Makridakis and Hibon, 2000).

For the sake of interpretation there is also a slightly modified version of SMAPE that ensures that the metric’s results will be always between 0% and 100%:

The following code snippet contains the sMAPE metric proposed by Makridakis (1993) and the modified version.

As mentioned at the beginning, there are controversies around the sMAPE. And they are true. Goodwin and Lawton (1999) pointed out that sMAPE gives more penalties to under-estimates more than to over-estimates (Chen et al., 2017). Cánovas (2009) proofs this fact with an easy example.

Table 1. Example with a symmetric sMAPE.

Table 2. Example with an asymmetric sMAPE.

Starting with table 1 we have two cases. In case 1 our actual value y is 100 and the prediction y_hat 150. This leads to a sMAPE value of 20 %. Case 2 is the opposite. Here we have an actual value y of 150 and a prediction y_hat of 100. This also leads to a sMAPE of 20 %. So far it seems symmetry is given…

Let us now have a look at table 2. We also have here two cases and as you can already see the sMAPE values are not the same anymore. The second case leads to a different SMAPE value of 33 %.

Modifying the forecast while holding fixed actual values and absolute deviation do not produce the same sMAPE’s value. Simply biasing the model without improving its accuracy should never produce different error values (Cánovas, 2009).

Relative Error Metrics

Compared to the error metrics explained before, relative error metrics compare your model’s performance (so it’s errors) with the performance of a baseline or benchmark model.

The most common benchmark models are naive, snaive and the mean of all observations.

In a naive or random walk model the prediction is just equal to the previous observation.

If you have seasonal data, it is useful to choose the snaive method. The snaive method sets each forecast to be equal to the last observed value from the same season of the year (e.g., the same month of the previous year). It is defined as follows:

where m is the seasonal period, and k the integer part of (h-1)/m (i.e., the number of complete years in the forecast period prior to time T+h). For monthly data this would mean that the forecast for all future October values is equal to the last observed October value (Hyndman and Athanasopoulos, 2018)

Due to their scale-independence, these metrics were recommended in studies by Armstrong and Collopy (1992) and by Fildes (1992) for assessing forecast accuracy across multiple series. However, when the calculated errors are small, the use of the naive method as a benchmark is no longer possible because it would lead to division by zero (Hyndman, 2006).

Median Relative Absolute Error (MdRAE)

where 𝑏𝑖 is benchmark forecast results and 𝑀 the seasonal period in our time series.

As mentioned in the introduction to this section, relative error metrics compare our model’s performance (forecast) with a benchmark method (i.e. random walk). The Median Relative Absolute Error (MdRAE) calculates the median of the difference between the absolute error of our forecast to the absolute error of a benchmark model.

If our model’s forecast equals to the benchmark’s forecast then the result is 1. If the benchmarks forecast are better than ours then the result will be above > 1. If ours is better than it’s below 1.

Since we are calculating the median, the MdRAE is more robust to outliers as other error metrics. However, the MdRAE has issues with dividing by zero. To avoid this difficulty, Armstrong and Collopy (1992) recommended that extreme values be trimmed; however, this increases both the complexity and the arbitrariness of the calculation, as the amount of trimming must be specified. (Kim and Kim, 2016).

Compared to the error metrics before, the relative error metrics are a bit more complex to calculate and interpret. Let’s have an example to strengthen our understanding.

Table 3. MdRAE calculation example (image by author).

Table 3 shows our actual values y, the predictions of our model y_hat and the forecasts from a naive benchmark model y_bnchmrk that used the last point from our training data set (see code above) as the prediction. Of course there are also other options to calculate the benchmark’s predictions (e.g. including seasonality, drift or just taking the mean of the training data).

The MdRAE then takes the median of the difference between the absolute error (y-y_hat) of our forecast divided by the absolute error (y-y_bnchmrk) of our benchmark model.

The result is 0.15 which is obviously smaller than 1 so our forecast is better than the one from the benchmark model.

Geometric Mean Relative Absolute Error (GMRAE)

where 𝑏𝑖 is benchmark forecast results and 𝑀 is the seasonal period in our time series.

Like the MdRAE the Geometric Mean Relative Absolute Error (GMRAE) compares the errors of our forecast with the one of a defined baseline model. However, instead of calculating the median, the GMRAE, as the name implies, calculates the geometric mean of our relative errors.

A GMRAE above 1 states that the benchmark is better, a result below 1 indicates that our model’s forecast performs better.

Taking an arithmetic mean of log-scaled error ratios (see alternative representation) makes the GMRAE more resistant to outliers. However, GMRAE is still sensitive to outliers. It can be dominated by not only a single large outlier, but also an extremely small error close to zero. This is because there is neither upper bound nor lower bound for the log-scaled error ratios calculated by GMRAE (Chen and Twycross, 2017). If the error of the benchmark method is zero then a large value is returned.

Scale-Free Error Metrics

Relative measures try to remove the scale of the data by comparing the forecasts with those obtained from some benchmark (naive) method. However, they have problems. Relative errors have a statistical distribution with undefined mean and infinite variance. They can only be computed when there are several forecasts on the same series, and so cannot be used to measure out-of-sample forecast accuracy at a single forecast horizon (Hyndman and Koehler, 2006).

To solve this issue, Hyndman and Koehler (2006) proposed a new kind of metric – the scale free error metric. Their idea was to scale the error based on the in-sample MAE from a naive (random walk) forecast method.

Mean Absolute Scaled Error (MASE)

The MASE is calculated by taking the MAE and dividing it by the MAE of an in-sample (so based on our training data) naive benchmark.

Values of MASE greater than 1 indicate that the forecasts are worse, on average, than in-sample one-step forecasts from the naive model (Hyndman and Koehler, 2006).

Since it is a scale free metric one is able to compare the model’s accuracy across (scale) different time series. Unlike the relative error metrics it does not give undefined or infinite values which makes it a suitable metric for time series data with zeros. The only case under which the MASE would be infinite or undefined is when all historical observations are equal or all of the actual values during the in-sample period were zeros (Kim and Kim, 2016).

However there are also some critical voices. Davydenko and Fildes (2013) argue that MASE introduces a bias towards overrating the performance of a benchmark forecast as a result of arithmetic averaging and MASE is vulnerable to outliers, as a result of dividing by small benchmark MAE values. Also due the fact that the MAE in the denominator is using in-sample data the metric might be more tricky to explain to business users as other (more simple) metrics.

C̶o̶n̶f̶u̶s̶i̶o̶n̶ Conclusion

As you have seen there is no silver bullet, no single best error metric. Each category or metric has its advantages and weaknesses. So it always depends on your individual use case or purpose and your underlying data. It is important not to just look at one single error metric when evaluating your model’s performance. Literature comes up with the following recommendations.

If all series are on the same scale, the data preprocessing procedures were performed (data cleaning, anomaly detection) and the task is to evaluate the forecast performance then the MAE can be preferred because it is simpler to explain (Hyndman and Koehler, 2006; Shcherbakov et al., 2013)

Chai and Draxler (2014) recommend to prefer RMSE over MAE when the error distribution is expected to be Gaussian.

In case the data contain outliers it is advisable to apply scaled measures like MASE. In this situation the horizon should be large enough, no identical values should be, the normalized factor should be not equal to zero (Shcherbakov et al., 2013).

The introduced error metrics may be the common ones but this does not imply that they are the best for your use case. Also as I mentioned new error metrics like average relative MAE (AvgRelMAE) or Unscaled Mean Bounded Relative Absolute Error (UMBRAE) are being developed and published frequently. So it is definitely worth to have a look at these metrics, what they are trying to improve (e.g. becoming more robust or symmetric) and how they might be suitable for your project.

Bibliography

Armstrong, J. Scott, and Fred Collopy. 1992. "Error Measures for Generalizing about Forecasting Methods: Empirical Comparisons." International Journal of Forecasting 8(1). doi: 10.1016/0169–2070(92)90008-W.

Chai, T., and R. R. Draxler. 2014. "Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? – Arguments against Avoiding RMSE in the Literature." Geoscientific Model Development 7(3). doi: 10.5194/gmd-7–1247–2014.

Chen, Chao, Jamie Twycross, and Jonathan M. Garibaldi. 2017. "A New Accuracy Measure Based on Bounded Relative Error for Time Series Forecasting." PLOS ONE 12(3). doi: 10.1371/journal.pone.0174202.

Goodwin, Paul, and Richard Lawton. 1999. "On the Asymmetry of the Symmetric MAPE." International Journal of Forecasting 15(4):405–8. doi: https://doi.org/10.1016/S0169-2070(99)00007-2.

Hyndman, Rob. 2006. "Another Look at Forecast Accuracy Metrics for Intermittent Demand." Foresight: The International Journal of Applied Forecasting 4:43–46.

Hyndman, Rob J., and Anne B. Koehler. 2006. "Another Look at Measures of Forecast Accuracy." International Journal of Forecasting 22(4). doi: 10.1016/j.ijforecast.2006.03.001.

Hyndman, Robin John, and George Athanasopoulos. 2018. Forecasting: Principles and Practice. 2nd ed. OTexts.

Kim, Sungil, and Heeyoung Kim. 2016. "A New Metric of Absolute Percentage Error for Intermittent Demand Forecasts." International Journal of Forecasting 32(3):669–79. doi: https://doi.org/10.1016/j.ijforecast.2015.12.003.

Makridakis, Spyros, and Michèle Hibon. 2000. "The M3-Competition: Results, Conclusions and Implications." International Journal of Forecasting 16(4):451–76. doi: https://doi.org/10.1016/S0169-2070(00)00057-1.

Shcherbakov, Maxim V., Adriaan Brebels, Anton Tyukov, Timur Janovsky, and Valeriy Anatol. 2013. "A Survey of Forecast Error Measures."

Willmott, C. J., and K. Matsuura. 2005. "Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance." Climate Research 30. doi: 10.3354/cr030079.

The post Time Series Forecast Error Metrics you should know appeared first on Towards Data Science.