Cambridge Analytica whistleblower: ‘We spent $1m harvesting millions of Facebook profiles’ – YouTube
— Read on m.youtube.com/watch
Short post. If you listen to what the nice data scientist is saying in the video, you will get a good idea of how the process for data analysis in the top end, and the great responsibility the data scientists have to not do evil.
(Next article for consumers of data science analysis to better understand the utility of the results. Previous one on classifiers is here)
I was in a meeting recently where a talented data scientist was showing his analysis on a problem predicting delay in a mobile network. There were lots cool graphs in a Jupyter notebook, and I asked him how well the algorithm performed. He said, “The RMSE is 0.5678, normalized.” On further discussion he indicated that the Root Mean Square Error (RMSE) was lower for this algorithm than other ones he tried (which is good – all things considered). But what I really wanted to know was how useful was his algorithm at predicting delay. What I had in my mind was a manager level answer, like, “We can predict delay plus or minus 0.5 seconds, 95% of the time”. We never really made it to that level of communication, because the only information he had was RMSE and he did not understand how to give me the information I wanted.
In the service of increasing the effective communications between data scientists and users of their analysis, I thought I would see what we could do with RMSE to understand utility of a regression algorithm.
RMSE is a measure of fit of an algorithm to the data available to make the algorithm – it is calculated based on the difference between the actual data and the algorithm generated value – called residuals. In a comparison of RMSEs from two different algorithms, the algorithm with the smaller value should have a better fit with the data because the difference in the residuals is smaller. There are pathological reasons (see) why in some cases a smaller RMSE does not mean a better algorithm, but in general it is a guiding rule of thumb for comparing algorithms.
Given the RMSE is available from most algorithms the question is “can we make some sort of statement around the margin of error to understand how useful is the algorithm?”. Margin of error as a concept is well known to many people. Consider the example below where I have a label (speed) that has a range of data that goes from 10 to 20 kph and an algorithm “A” to predict the label with an accuracy of +/- 10 kph 95% of the time.
I would argue that this is not a very useful algorithm, based on the margin of error analysis. For example, if the algorithm predicts a certain value is 14 kph I know there is a +/- 10 kph error around that prediction. The actual value be anything from 4 kph to 24 kph, and given that my data only has a range that goes from 10 to 20 … basically any value in the data set range. Sort of like me guessing a number between 10 and 20.
That said, I am sorry to disappoint but you cannot use RMSE to build a margin of error, since you need to know the probability distribution of the residuals, and this is not readily knowable or predictable. Though this is the correct answer, it is not a very useful answer since I still do not know anything meaninful about how useful is the algorithm.
However, a slightly less correct and potentially more useful answer is that *if* the residuals are randomly distributed around 0 (meaning, most of the predictions are pretty good and the good and bad prediction are “evenly” distributed), you can start to make some opinions on the prediction range. Consider the picture below of a set of random residuals from a hypothetical algorithm run.
This is a simulation of residuals that is based on a uniform random distribution. You can see the calculate +/- RMSE value as a red line, and 2 times the +/- RMSE value as the green line. Based on several simulations, I can tell you that 100% of the values in a uniform random distribution are within 2 x RMSE. So *if* you think your residuals follow a uniform random distribution, well then *all* of your prediction results will be within 2 x RMSE.
The next picture is from a uniform normal distribution
And here the RMSE acts just like the standard deviation – 2 times the RMSE limit covers 95.45% of data. Even in a slightly contrived pathological case (growing error based on uniform random distribution) 2 times the RMSE will cover 93.82% of the data.
To summarize, you should not use RMSE to make a statement around margin of error because you don’t really know the distribution of the residuals. But if you do proceed down this path l it should work fairly well, meaning that ~95% of the residuals will be within 2 x RMSE. And if you have a talented data scientist reporting the RMSE, you can ask her what percentage of the residuals fit into a 2 times RMSE bound. You can even ask her to use the Kolmogorov Smirnov Test (see article from a colleague) to do some validation of the distribution of the residuals compared to a normal or uniform random distribution. I am sure they will be happy to help with this precise request.
“Is that all you’ve got to show for seven and a half million years’ work?”
“I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
I have been to a few presentations over the past weeks where “accuracy” and “RMSE” have been used, with little insight into the utility of an algorithm. In one case there was a meeting were two teams went at it, arguing on their approach to a problem and comparing accuracy results, while I noticed that neither approach actual did anything useful. Or consider the case of the vendor that sold a friend of mine a defect detection solution that was 95% accurate to detect faults, but the faults only occur 1 in 1000 times from their sampling. It was 95% accurate, and utterly useless to my friend’s company.
So, in the service of increasing the effective communications between data scientists and consumers of their analysis, I thought I would discuss utility of an algorithm in terms of measures that are not always brought up in presentation or reviews.
Accuracy is rarely useful
When it comes to classification problems, it may be true there are data sets in the wild that are nicely balanced, but I never seem to run into them. In a 2-class classification problem, I normally see 1-5% of the data in one class and the rest in the other class (e.g. churners versus non-churners subscribers from a telecommunication network). In the case of unbalanced datasets, accuracy (I.e. number classified correctly / total number of records) is rarely useful. Here is an example as to why.
Consider a classifier that predicts if a mobile network subscriber will churn or not, with a known set of data 1000 subscribers that have 1% churners. I have included the definitions and values of some different classifier measures
In the example above, the accuracy is 98.5% which sounds good as a headline. But let’s dig a little deeper
From a non-technical perspective, I would offer the following interpretations of these other measures:
“TP Rate” or “Sensitivity” or “Recall“: This is how many actual positive cases were found by the algorithm. In this example it is 50%, meaning that half the real churners were found and the other half were missed.
“Precision“: This is how many actual positive cases were found from all the positive cases predicted. In this example it is 33.3%, meaning that there is a 1/3 chance that what the algorithm predicts is a churner is really a churner.
The high accuracy value in the example seems a little less relevant in the light of these other measures. While the accuracy is high, the algorithm does not seem to actually be good at finding churners (“recall”) and potentially worse 2/3 of the churners it predicts are not churners (“precision”). This could be a significant problem if a company takes expensive actions to retain subscribers, since only 1/3 of the amounts will be focused on actual churners.
The good news here is that algorithms can be tuned to focus on different measures, to provide insights more in line with the client’s circumstances. Maybe the rate of successful predictions is important, and false positives or negatives less of an issue. Or too many false positives are a bigger issue than missing a few true positives. This is something that should be discussed between the client and the data scientist, to ensure that the best algorithm for the client is developed – not just the most accurate one.
In the previous blog post (link) I talked about how to interpret places from journey records, proposing a visualization based means to identify key places (“home”, “work”) and then to do a validation using common mapping services like Google. In this post I am going to explore analytic insights on the place information to address questions of the data related to likelihood of journeys and their habitual nature.
Preparing the data
The data set I used for the place frequency analysis until now was basically a list of journeys with start and stop places, with time and date information. A snippet is shown below:
Next I want to understand how often during the sample period the car in question went from one place to another. The data set contains actual journeys, but what is missing is the days and times there were no journeys. For example, if a car went from place 1 to place 2 on Monday 3 times in the data set, I don’t immediately know how habitual that is that is unless I know how many “Mondays” were present in the sample time-frame.
What I do know at this point is that depending on the car, I have about 1.5 – 2 months of data — between 6 and 8 weeks — which when considering insights into journey habits maybe not be a lot. My instincts at this point are to first consider coarse level temporal patterns – like daily events – before moving to more granular analysis – like time related hourly or time of day patterns.
I processed the actual journeys to expand the non-travel dates and also sum the number of journeys between places on each day of the week. You can see a snippet of the file below, focusing on Monday and Friday journeys for a particular car:
What you can see is that (on row 2) on “Mondays” there were a total of 13 journeys from place 3 to place 3 that occurred over 4 days (journey.days) and 2 days (non.journey.days) when there was no travel. If you add the jouney.days and non.journey.days features together you can figure out that there was a total of 6 “Mondays” in my sample set (i.e. 6 weeks of data). And you can also see (on row 64) that there were no journeys on Mondays between place 1 to place 1.
After expanded data set of journeys was built, I created 2 3-dimensional matrices:
Total number of trips per day from one place to another (journey.count.x)
Number of days in the sample at least 1 trip occurred per day from one place to another (pred.journey)
The matrix structure for each is:
(journey start place — “from”Xjourney end place — “to”Xday of the week)
(total number of placesX total number of placeX7)
I have included the code snippets at the end of the blog to see the necessary transformations to make these matrices from the data frame above.
So now the fun can begin! Let’s consider some questions related to temporal journey patterns
Where does the car usually go?
This is really two questions hiding inside one:
What journeys does the car take on a regular basis?
What journeys does the car NOT take on a regular basis?
Both of these questions related to habitual patterns in the data, in contrast to non-habitual patterns (i.e. unpredictable patterns). Both babitual and non-habitual behaviour are interesting to know, since this will allow us to form quantitative statements on likelihood of journeys on certain days.
Take an example: Assume a particular car has gone from place 3 to place 6 five dates out of six on Tuesdays. This seems to a “usual” journey, but how usual is it?
This question can be formulated as a probabilistic hypothesis that we can use the sample data to resolve. Given we have trial information (5 out of 6 occurrences), this sounds like an application of binomial probability, but in this case we don’t know the probability (p) of the traveling from location 3 to 6. However, what we can do is calculate the probability of getting 5/6 occurrences with different p values and examine the likelihood that certain p values are “unlikely”. In this case I have rather arbitrarily set “unlikely” as less than 10%, meaning I want to have a 90% level of likeliness that my p value is correct.
To be a bit more precise, I want to find the value of p (pi) such that I can reject the following (null) hypothesis at 90% level.
Journeys between place 3 and place 5 on day Tuesdays are follow a binomial distribution with p value pi
Below are the probability values for different of p and different number of journeys out of a maximum of 6.
We can see that for 5 journey days out of 6 (“5T”), the p value (“p”) of 0.50 has a probability of 0.094, and a p value of 0.55 has a probability of 0.136. If the value of pi in the above hypothesis was 0.50, I could reject the hypothesis since there is only a 9.4% chance I can have 5/6 journeys with that p value. In fact, I can reject the hypothesis for all value less than 0.50 as well, because their probabilities are also lower than 10%. What I am left with is the statement that journeys between place 3 and 6 on Tuesdays is “usual”, where the p value is greater than 0.55, or in other words there is a >55% chance that a car will go from place 3 to place 6 on Tuesdays.
What about trips not taken? In our data set, there are lots of non-journeys on days (i.e. days of the week with no journeys). For any location pair on any day with no journeys (0 out of 6 – column “0T”), we can see that p value of 0.35 or less cannot be rejected at the 90% level of likeliness. So, in our case “unusual” means a p value of less than 0.35.
One take away from this analysis is that a sample of 6 weeks does not give us a lot information to make really significant statements on the data. Even with no observed travel between 2 places in a day we can only be confident that the probability of travel between them is less than 35%. If we had 10 weeks of information (see below for a table I worked out with 10 weeks journey info), then the binomial probability of taking no trips over 10 weeks would be less than 0.25 at a 90% confidence level. Similarly, if there was a journey 9 out of 10 weeks we could side the binomial probability would be greater than 0.8 at a 90% confidence level.
A bit unsatisfuing, but having some data is better than having no data, and being able to quantify our intuition about “usual” into a probabilistic statement is something.
In terms of scripting, R provides some simple ways to extract the relevant car information from our journey count matrix once they have been developed — arguably better than the original data frame.
To find all the days and journeys that have 4, 5 or 6 journeys, we can use the syntax:
which(pred.journey.x > 3, arr.in=TRUE)
The result is show below, where “dim1” is the row, “dim2” is the column and “dim3” is the day from the original matrix.
For example, we can see that on Mondays (“dim3” = 1) the car had more than 3 journey days out of 6 between places 3 to 3, 10 to 3 and 3 to 10. Similarly, on Tuesdays there were 4 or more journey days between 3 to 6 and 6 to 7.
You can play with the conditional statement in the “which” command to select exact journey values or other conditional queries you can imagine.
You can also dig a little deeper into the journey days for specific locations or days. Here is a command to show all the journey days per day of the week that occurred between location 3 and 6.
Are there places that the car goes to frequently?
On examining the data, I could see that for most day and journey pairs, there is 1 or 0 journeys. However, there are a few places and days where there are more frequent journeys. Below is a report (code at bottom to generate it) that shows the journey locations and days for a specific car (“dim1”, “dim2”, “dim3”) that had more journeys per day, with journeys count (“journey.count”) and days traveled (“journey.days”) and the ratio of journeys to days (“ratio.v”)
What you can observe is that on Mondays (“dim3” = 1) there are “some regular trips” (with trip probability of between 0.20 and 0.85 based on the fact 4 /6 weeks there was travel) from 3 to 3 (ratio 3.25) and 10 to 3 (ratio 1.25).
While this has been an interesting exercise, the less than satisfying part for me is the lack of profound (even relatively profound) statements that can be made on the data, given the relatively little amount of information available. However, in the land of the blind the one-eyed man is king, so some information is better than none, and the ability to put some precision to our intuition of patterns I think is valuable.
If I had more data, I would have liked to expand the analysis to cover time as well, both hourly or time of day (e.g. morning, afternoon, evening). The structure of the analysis matrices would be the same, except I would have to add another dimension for time or time of day (thus making a 4-dimensional matrix). Based on the hypothesis testing I have done with the days, I would think much more than 10 weeks data should be sufficient to start considering this level of detail.
Happy location playing!
Creating the location and frequency matrices
# Section 1. Load the data from journey file
# this section takes in the dataframes (reads them in first from CSV files)
# select a particlar car and day to examine
# > str(df.x)
#'data.frame': 700 obs. of 6 variables:
# $ from : int 1 1 1 1 1 1 1 1 1 1 …
#$ to : int 1 1 1 1 1 1 1 2 2 2 …
#$ day : chr "Monday" "Tuesday" "Wednesday" "Thursday" …
#$ journeys : int 0 0 0 0 0 0 0 0 1 1 …
#$ journey.days : int 0 0 0 0 0 0 0 0 1 1 …
#$ non.journey.days: int 6 6 6 6 6 6 6 6 5 5 …
#from to day journeys journey.days non.journey.days