“Forty-two!” yelled Loonquawl.
“Is that all you’ve got to show for seven and a half million years’ work?”
“I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
I have been to a few presentations over the past weeks where “accuracy” and “RMSE” have been used, with little insight into the utility of an algorithm. In one case there was a meeting were two teams went at it, arguing on their approach to a problem and comparing accuracy results, while I noticed that neither approach actual did anything useful. Or consider the case of the vendor that sold a friend of mine a defect detection solution that was 95% accurate to detect faults, but the faults only occur 1 in 1000 times from their sampling. It was 95% accurate, and utterly useless to my friend’s company.
So, in the service of increasing the effective communications between data scientists and consumers of their analysis, I thought I would discuss utility of an algorithm in terms of measures that are not always brought up in presentation or reviews.
Accuracy is rarely useful
When it comes to classification problems, it may be true there are data sets in the wild that are nicely balanced, but I never seem to run into them. In a 2-class classification problem, I normally see 1-5% of the data in one class and the rest in the other class (e.g. churners versus non-churners subscribers from a telecommunication network). In the case of unbalanced datasets, accuracy (I.e. number classified correctly / total number of records) is rarely useful. Here is an example as to why.
Consider a classifier that predicts if a mobile network subscriber will churn or not, with a known set of data 1000 subscribers that have 1% churners. I have included the definitions and values of some different classifier measures
In the example above, the accuracy is 98.5% which sounds good as a headline. But let’s dig a little deeper
From a non-technical perspective, I would offer the following interpretations of these other measures:
- “TP Rate” or “Sensitivity” or “Recall“: This is how many actual positive cases were found by the algorithm. In this example it is 50%, meaning that half the real churners were found and the other half were missed.
- “Precision“: This is how many actual positive cases were found from all the positive cases predicted. In this example it is 33.3%, meaning that there is a 1/3 chance that what the algorithm predicts is a churner is really a churner.
The high accuracy value in the example seems a little less relevant in the light of these other measures. While the accuracy is high, the algorithm does not seem to actually be good at finding churners (“recall”) and potentially worse 2/3 of the churners it predicts are not churners (“precision”). This could be a significant problem if a company takes expensive actions to retain subscribers, since only 1/3 of the amounts will be focused on actual churners.
The good news here is that algorithms can be tuned to focus on different measures, to provide insights more in line with the client’s circumstances. Maybe the rate of successful predictions is important, and false positives or negatives less of an issue. Or too many false positives are a bigger issue than missing a few true positives. This is something that should be discussed between the client and the data scientist, to ensure that the best algorithm for the client is developed – not just the most accurate one.