(Next article for consumers of data science analysis to better understand the utility of the results. Previous one on classifiers is here)
I was in a meeting recently where a talented data scientist was showing his analysis on a problem predicting delay in a mobile network. There were lots cool graphs in a Jupyter notebook, and I asked him how well the algorithm performed. He said, “The RMSE is 0.5678, normalized.” On further discussion he indicated that the Root Mean Square Error (RMSE) was lower for this algorithm than other ones he tried (which is good – all things considered). But what I really wanted to know was how useful was his algorithm at predicting delay. What I had in my mind was a manager level answer, like, “We can predict delay plus or minus 0.5 seconds, 95% of the time”. We never really made it to that level of communication, because the only information he had was RMSE and he did not understand how to give me the information I wanted.
In the service of increasing the effective communications between data scientists and users of their analysis, I thought I would see what we could do with RMSE to understand utility of a regression algorithm.
RMSE is a measure of fit of an algorithm to the data available to make the algorithm – it is calculated based on the difference between the actual data and the algorithm generated value – called residuals. In a comparison of RMSEs from two different algorithms, the algorithm with the smaller value should have a better fit with the data because the difference in the residuals is smaller. There are pathological reasons (see) why in some cases a smaller RMSE does not mean a better algorithm, but in general it is a guiding rule of thumb for comparing algorithms.
Given the RMSE is available from most algorithms the question is “can we make some sort of statement around the margin of error to understand how useful is the algorithm?”. Margin of error as a concept is well known to many people. Consider the example below where I have a label (speed) that has a range of data that goes from 10 to 20 kph and an algorithm “A” to predict the label with an accuracy of +/- 10 kph 95% of the time.
I would argue that this is not a very useful algorithm, based on the margin of error analysis. For example, if the algorithm predicts a certain value is 14 kph I know there is a +/- 10 kph error around that prediction. The actual value be anything from 4 kph to 24 kph, and given that my data only has a range that goes from 10 to 20 … basically any value in the data set range. Sort of like me guessing a number between 10 and 20.
That said, I am sorry to disappoint but you cannot use RMSE to build a margin of error, since you need to know the probability distribution of the residuals, and this is not readily knowable or predictable. Though this is the correct answer, it is not a very useful answer since I still do not know anything meaninful about how useful is the algorithm.
However, a slightly less correct and potentially more useful answer is that *if* the residuals are randomly distributed around 0 (meaning, most of the predictions are pretty good and the good and bad prediction are “evenly” distributed), you can start to make some opinions on the prediction range. Consider the picture below of a set of random residuals from a hypothetical algorithm run.
This is a simulation of residuals that is based on a uniform random distribution. You can see the calculate +/- RMSE value as a red line, and 2 times the +/- RMSE value as the green line. Based on several simulations, I can tell you that 100% of the values in a uniform random distribution are within 2 x RMSE. So *if* you think your residuals follow a uniform random distribution, well then *all* of your prediction results will be within 2 x RMSE.
The next picture is from a uniform normal distribution
And here the RMSE acts just like the standard deviation – 2 times the RMSE limit covers 95.45% of data. Even in a slightly contrived pathological case (growing error based on uniform random distribution) 2 times the RMSE will cover 93.82% of the data.
To summarize, you should not use RMSE to make a statement around margin of error because you don’t really know the distribution of the residuals. But if you do proceed down this path l it should work fairly well, meaning that ~95% of the residuals will be within 2 x RMSE. And if you have a talented data scientist reporting the RMSE, you can ask her what percentage of the residuals fit into a 2 times RMSE bound. You can even ask her to use the Kolmogorov Smirnov Test (see article from a colleague) to do some validation of the distribution of the residuals compared to a normal or uniform random distribution. I am sure they will be happy to help with this precise request.