Cambridge Analytica whistleblower: ‘We spent $1m harvesting millions of Facebook profiles’ – YouTube
— Read on m.youtube.com/watch
Short post. If you listen to what the nice data scientist is saying in the video, you will get a good idea of how the process for data analysis in the top end, and the great responsibility the data scientists have to not do evil.
(Next article for consumers of data science analysis to better understand the utility of the results. Previous one on classifiers is here)
I was in a meeting recently where a talented data scientist was showing his analysis on a problem predicting delay in a mobile network. There were lots cool graphs in a Jupyter notebook, and I asked him how well the algorithm performed. He said, “The RMSE is 0.5678, normalized.” On further discussion he indicated that the Root Mean Square Error (RMSE) was lower for this algorithm than other ones he tried (which is good – all things considered). But what I really wanted to know was how useful was his algorithm at predicting delay. What I had in my mind was a manager level answer, like, “We can predict delay plus or minus 0.5 seconds, 95% of the time”. We never really made it to that level of communication, because the only information he had was RMSE and he did not understand how to give me the information I wanted.
In the service of increasing the effective communications between data scientists and users of their analysis, I thought I would see what we could do with RMSE to understand utility of a regression algorithm.
RMSE is a measure of fit of an algorithm to the data available to make the algorithm – it is calculated based on the difference between the actual data and the algorithm generated value – called residuals. In a comparison of RMSEs from two different algorithms, the algorithm with the smaller value should have a better fit with the data because the difference in the residuals is smaller. There are pathological reasons (see) why in some cases a smaller RMSE does not mean a better algorithm, but in general it is a guiding rule of thumb for comparing algorithms.
Given the RMSE is available from most algorithms the question is “can we make some sort of statement around the margin of error to understand how useful is the algorithm?”. Margin of error as a concept is well known to many people. Consider the example below where I have a label (speed) that has a range of data that goes from 10 to 20 kph and an algorithm “A” to predict the label with an accuracy of +/- 10 kph 95% of the time.
I would argue that this is not a very useful algorithm, based on the margin of error analysis. For example, if the algorithm predicts a certain value is 14 kph I know there is a +/- 10 kph error around that prediction. The actual value be anything from 4 kph to 24 kph, and given that my data only has a range that goes from 10 to 20 … basically any value in the data set range. Sort of like me guessing a number between 10 and 20.
That said, I am sorry to disappoint but you cannot use RMSE to build a margin of error, since you need to know the probability distribution of the residuals, and this is not readily knowable or predictable. Though this is the correct answer, it is not a very useful answer since I still do not know anything meaninful about how useful is the algorithm.
However, a slightly less correct and potentially more useful answer is that *if* the residuals are randomly distributed around 0 (meaning, most of the predictions are pretty good and the good and bad prediction are “evenly” distributed), you can start to make some opinions on the prediction range. Consider the picture below of a set of random residuals from a hypothetical algorithm run.
This is a simulation of residuals that is based on a uniform random distribution. You can see the calculate +/- RMSE value as a red line, and 2 times the +/- RMSE value as the green line. Based on several simulations, I can tell you that 100% of the values in a uniform random distribution are within 2 x RMSE. So *if* you think your residuals follow a uniform random distribution, well then *all* of your prediction results will be within 2 x RMSE.
The next picture is from a uniform normal distribution
And here the RMSE acts just like the standard deviation – 2 times the RMSE limit covers 95.45% of data. Even in a slightly contrived pathological case (growing error based on uniform random distribution) 2 times the RMSE will cover 93.82% of the data.
To summarize, you should not use RMSE to make a statement around margin of error because you don’t really know the distribution of the residuals. But if you do proceed down this path l it should work fairly well, meaning that ~95% of the residuals will be within 2 x RMSE. And if you have a talented data scientist reporting the RMSE, you can ask her what percentage of the residuals fit into a 2 times RMSE bound. You can even ask her to use the Kolmogorov Smirnov Test (see article from a colleague) to do some validation of the distribution of the residuals compared to a normal or uniform random distribution. I am sure they will be happy to help with this precise request.
“Is that all you’ve got to show for seven and a half million years’ work?”
“I checked it very thoroughly,” said the computer, “and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
I have been to a few presentations over the past weeks where “accuracy” and “RMSE” have been used, with little insight into the utility of an algorithm. In one case there was a meeting were two teams went at it, arguing on their approach to a problem and comparing accuracy results, while I noticed that neither approach actual did anything useful. Or consider the case of the vendor that sold a friend of mine a defect detection solution that was 95% accurate to detect faults, but the faults only occur 1 in 1000 times from their sampling. It was 95% accurate, and utterly useless to my friend’s company.
So, in the service of increasing the effective communications between data scientists and consumers of their analysis, I thought I would discuss utility of an algorithm in terms of measures that are not always brought up in presentation or reviews.
Accuracy is rarely useful
When it comes to classification problems, it may be true there are data sets in the wild that are nicely balanced, but I never seem to run into them. In a 2-class classification problem, I normally see 1-5% of the data in one class and the rest in the other class (e.g. churners versus non-churners subscribers from a telecommunication network). In the case of unbalanced datasets, accuracy (I.e. number classified correctly / total number of records) is rarely useful. Here is an example as to why.
Consider a classifier that predicts if a mobile network subscriber will churn or not, with a known set of data 1000 subscribers that have 1% churners. I have included the definitions and values of some different classifier measures
In the example above, the accuracy is 98.5% which sounds good as a headline. But let’s dig a little deeper
From a non-technical perspective, I would offer the following interpretations of these other measures:
“TP Rate” or “Sensitivity” or “Recall“: This is how many actual positive cases were found by the algorithm. In this example it is 50%, meaning that half the real churners were found and the other half were missed.
“Precision“: This is how many actual positive cases were found from all the positive cases predicted. In this example it is 33.3%, meaning that there is a 1/3 chance that what the algorithm predicts is a churner is really a churner.
The high accuracy value in the example seems a little less relevant in the light of these other measures. While the accuracy is high, the algorithm does not seem to actually be good at finding churners (“recall”) and potentially worse 2/3 of the churners it predicts are not churners (“precision”). This could be a significant problem if a company takes expensive actions to retain subscribers, since only 1/3 of the amounts will be focused on actual churners.
The good news here is that algorithms can be tuned to focus on different measures, to provide insights more in line with the client’s circumstances. Maybe the rate of successful predictions is important, and false positives or negatives less of an issue. Or too many false positives are a bigger issue than missing a few true positives. This is something that should be discussed between the client and the data scientist, to ensure that the best algorithm for the client is developed – not just the most accurate one.
In the previous blog post (link) I talked about how to interpret places from journey records, proposing a visualization based means to identify key places (“home”, “work”) and then to do a validation using common mapping services like Google. In this post I am going to explore analytic insights on the place information to address questions of the data related to likelihood of journeys and their habitual nature.
Preparing the data
The data set I used for the place frequency analysis until now was basically a list of journeys with start and stop places, with time and date information. A snippet is shown below:
Next I want to understand how often during the sample period the car in question went from one place to another. The data set contains actual journeys, but what is missing is the days and times there were no journeys. For example, if a car went from place 1 to place 2 on Monday 3 times in the data set, I don’t immediately know how habitual that is that is unless I know how many “Mondays” were present in the sample time-frame.
What I do know at this point is that depending on the car, I have about 1.5 – 2 months of data — between 6 and 8 weeks — which when considering insights into journey habits maybe not be a lot. My instincts at this point are to first consider coarse level temporal patterns – like daily events – before moving to more granular analysis – like time related hourly or time of day patterns.
I processed the actual journeys to expand the non-travel dates and also sum the number of journeys between places on each day of the week. You can see a snippet of the file below, focusing on Monday and Friday journeys for a particular car:
What you can see is that (on row 2) on “Mondays” there were a total of 13 journeys from place 3 to place 3 that occurred over 4 days (journey.days) and 2 days (non.journey.days) when there was no travel. If you add the jouney.days and non.journey.days features together you can figure out that there was a total of 6 “Mondays” in my sample set (i.e. 6 weeks of data). And you can also see (on row 64) that there were no journeys on Mondays between place 1 to place 1.
After expanded data set of journeys was built, I created 2 3-dimensional matrices:
Total number of trips per day from one place to another (journey.count.x)
Number of days in the sample at least 1 trip occurred per day from one place to another (pred.journey)
The matrix structure for each is:
(journey start place — “from”Xjourney end place — “to”Xday of the week)
(total number of placesX total number of placeX7)
I have included the code snippets at the end of the blog to see the necessary transformations to make these matrices from the data frame above.
So now the fun can begin! Let’s consider some questions related to temporal journey patterns
Where does the car usually go?
This is really two questions hiding inside one:
What journeys does the car take on a regular basis?
What journeys does the car NOT take on a regular basis?
Both of these questions related to habitual patterns in the data, in contrast to non-habitual patterns (i.e. unpredictable patterns). Both babitual and non-habitual behaviour are interesting to know, since this will allow us to form quantitative statements on likelihood of journeys on certain days.
Take an example: Assume a particular car has gone from place 3 to place 6 five dates out of six on Tuesdays. This seems to a “usual” journey, but how usual is it?
This question can be formulated as a probabilistic hypothesis that we can use the sample data to resolve. Given we have trial information (5 out of 6 occurrences), this sounds like an application of binomial probability, but in this case we don’t know the probability (p) of the traveling from location 3 to 6. However, what we can do is calculate the probability of getting 5/6 occurrences with different p values and examine the likelihood that certain p values are “unlikely”. In this case I have rather arbitrarily set “unlikely” as less than 10%, meaning I want to have a 90% level of likeliness that my p value is correct.
To be a bit more precise, I want to find the value of p (pi) such that I can reject the following (null) hypothesis at 90% level.
Journeys between place 3 and place 5 on day Tuesdays are follow a binomial distribution with p value pi
Below are the probability values for different of p and different number of journeys out of a maximum of 6.
We can see that for 5 journey days out of 6 (“5T”), the p value (“p”) of 0.50 has a probability of 0.094, and a p value of 0.55 has a probability of 0.136. If the value of pi in the above hypothesis was 0.50, I could reject the hypothesis since there is only a 9.4% chance I can have 5/6 journeys with that p value. In fact, I can reject the hypothesis for all value less than 0.50 as well, because their probabilities are also lower than 10%. What I am left with is the statement that journeys between place 3 and 6 on Tuesdays is “usual”, where the p value is greater than 0.55, or in other words there is a >55% chance that a car will go from place 3 to place 6 on Tuesdays.
What about trips not taken? In our data set, there are lots of non-journeys on days (i.e. days of the week with no journeys). For any location pair on any day with no journeys (0 out of 6 – column “0T”), we can see that p value of 0.35 or less cannot be rejected at the 90% level of likeliness. So, in our case “unusual” means a p value of less than 0.35.
One take away from this analysis is that a sample of 6 weeks does not give us a lot information to make really significant statements on the data. Even with no observed travel between 2 places in a day we can only be confident that the probability of travel between them is less than 35%. If we had 10 weeks of information (see below for a table I worked out with 10 weeks journey info), then the binomial probability of taking no trips over 10 weeks would be less than 0.25 at a 90% confidence level. Similarly, if there was a journey 9 out of 10 weeks we could side the binomial probability would be greater than 0.8 at a 90% confidence level.
A bit unsatisfuing, but having some data is better than having no data, and being able to quantify our intuition about “usual” into a probabilistic statement is something.
In terms of scripting, R provides some simple ways to extract the relevant car information from our journey count matrix once they have been developed — arguably better than the original data frame.
To find all the days and journeys that have 4, 5 or 6 journeys, we can use the syntax:
which(pred.journey.x > 3, arr.in=TRUE)
The result is show below, where “dim1” is the row, “dim2” is the column and “dim3” is the day from the original matrix.
For example, we can see that on Mondays (“dim3” = 1) the car had more than 3 journey days out of 6 between places 3 to 3, 10 to 3 and 3 to 10. Similarly, on Tuesdays there were 4 or more journey days between 3 to 6 and 6 to 7.
You can play with the conditional statement in the “which” command to select exact journey values or other conditional queries you can imagine.
You can also dig a little deeper into the journey days for specific locations or days. Here is a command to show all the journey days per day of the week that occurred between location 3 and 6.
Are there places that the car goes to frequently?
On examining the data, I could see that for most day and journey pairs, there is 1 or 0 journeys. However, there are a few places and days where there are more frequent journeys. Below is a report (code at bottom to generate it) that shows the journey locations and days for a specific car (“dim1”, “dim2”, “dim3”) that had more journeys per day, with journeys count (“journey.count”) and days traveled (“journey.days”) and the ratio of journeys to days (“ratio.v”)
What you can observe is that on Mondays (“dim3” = 1) there are “some regular trips” (with trip probability of between 0.20 and 0.85 based on the fact 4 /6 weeks there was travel) from 3 to 3 (ratio 3.25) and 10 to 3 (ratio 1.25).
While this has been an interesting exercise, the less than satisfying part for me is the lack of profound (even relatively profound) statements that can be made on the data, given the relatively little amount of information available. However, in the land of the blind the one-eyed man is king, so some information is better than none, and the ability to put some precision to our intuition of patterns I think is valuable.
If I had more data, I would have liked to expand the analysis to cover time as well, both hourly or time of day (e.g. morning, afternoon, evening). The structure of the analysis matrices would be the same, except I would have to add another dimension for time or time of day (thus making a 4-dimensional matrix). Based on the hypothesis testing I have done with the days, I would think much more than 10 weeks data should be sufficient to start considering this level of detail.
Happy location playing!
Creating the location and frequency matrices
# Section 1. Load the data from journey file
# this section takes in the dataframes (reads them in first from CSV files)
# select a particlar car and day to examine
# > str(df.x)
#'data.frame': 700 obs. of 6 variables:
# $ from : int 1 1 1 1 1 1 1 1 1 1 …
#$ to : int 1 1 1 1 1 1 1 2 2 2 …
#$ day : chr "Monday" "Tuesday" "Wednesday" "Thursday" …
#$ journeys : int 0 0 0 0 0 0 0 0 1 1 …
#$ journey.days : int 0 0 0 0 0 0 0 0 1 1 …
#$ non.journey.days: int 6 6 6 6 6 6 6 6 5 5 …
#from to day journeys journey.days non.journey.days
In the previous blog post (link) I talked about how to convert location data into places information based on a density based clustering of the journey start and end locations from a set of cars\vehicles over a two month period. In this post I am going to explore what information we can conclude from the place information.
Being the good data scientists that we all are, the natural thing to reach for when examining some data for the first time is usual a histogram. For both numeric and categorical data a histogram representation usually gives some insights about spread of the data and the density or frequency with respect to range bins or categories. In the image below, you can see histograms of the place clusters for 3 cars with the y-axis representing the percentage of the journeys that either had a stop or start in a place cluster.
For these cars, we can get some interesting insights from these histograms.
Highest frequency clusters
In all the histograms (and in fact for the majority of the cars I studied) there was one cluster place that had the majority of the start and stop. For all the cars shown above this is cluster 1. Now if this was my personal car, I would call this place “home” — where I live. For service vehicles (like delivery vans) this may be the store or factory. For rental cars, this could be the rental car agency location. Whatever the situation, it is the place where the car ends up most of the time. There are other peaks as well, like cluster 2 for car 1. If this was a residential car and I think about my own behaviour, I would call this place “work”. Of course this analysis is speculative unless there is some additional data we can bring in, which I will talk about later on.
Recall the previous blog that place cluster “0” represents the locations that were not associated with any other cluster, nor could form themselves into a cluster. Any car that a lot of cluster “0” journey starts and stops does not spend a lot of time going to the same places. They are not predictable, at least with respect to their journeys.
Compare car 1 and car 2. Car 1 has ~8% of its journeys starting or stopping in non-clustered areas. Car 2 has ~23% outside of clusters. With respect to car 1, car 2 seems less predictable in its driving habits.
Below are two rather pathological examples – cars 19 and 20.
Both of these cars have only 1 cluster (cluster 1), and many or most (in the case of car 20) of their journey start and stops are in unique location – or at least greater than 550 meters away from each other. Without any further insights I would venture they are taxis or possibly rented cars.
Number of clusters
Cars with more clusters go to more places than cars with less clusters – which may seem a rather obvious statement. But another way to consider that information is that a car with less clusters is probably more predictable in the journey origins and destinations, since they go to less places. Consider car 1 and car 3 from figure 1. Approximately 75% of car 1’s journeys start or end in cluster 1 or 2. For car 3, the two most common clusters are 1 and 9, but associated number of starts and ends is only ~48%. If I was looking to find car 1 at any point of time – and without any more information – I would likely find it in place cluster 1 or 2.
Another interesting thing to consider is how the number of clusters relates to the size of cluster “0”. Consider car 2 and car 3 in figure 1; car 2 has more starts and stops outside of common locations than car 3, but car 3 has more clusters (13 versus 9). One could conclude that car 3 “usually” goes to many of the same places, whereas car 2 goes to many different places and less usual places. I have no ground truth here, but car 2 seems more like a taxi whereas car 3 looks more like a delivery van going on a regular delivery route.
Adding the geographical context
If we add a geographical information to the histogram information, we can start to make more inferences about a car’s patterns of journeys. By considering the actual places using a mapping service (like Google maps) and street visualization (like Google Streetview) additional information can be obtained to enrich the place histogram analysis.
Consider the mapping of car 23 places onto Google maps below
For privacy reasons I have removed the actual location of place 1 and place 2, but I can say that place 1 is a residential location and place 2 was an industrial location. Given the nature of the locations and the fact that place cluster 1 has more start/stops that cluster 2, and the remaining clusters are relatively small, it seems reasonable to call place cluster 1 as “home” and cluster 2 as “work”.
Powerful insights indeed, but somewhat limited by human analysis effort (I.e. considering the relative cluster sizes, looking at the address in Google, going into Streetview, etc). For 1 or 2 or even 20 cars this is manageable, but as part of an automated process, this approach is a bit lacking. I did not explore it but possibly the location enrichment I did with Google can be automated via APIs, nor did I try to make some rules related to histogram analysis, so that that is a possibly something to follow up on in a future project.
In the next blog I go beyond this visualization approach to consider numeric approaches to describe and (hopefully) make some probability statements on journey destinations and origins. The questions I want to get to are:
“If a car is in a particular place at a particular time and day, what is the most likely place it is going to next and when?”
“Is it unusual to have a particular car in this place at this time and day?”
Most of the time I work with data sets on projects I have location information. Either directly reported – like GPS coordinates, or implied – as with phone users connections to mobile cell sites. How to incorporate this location data into a model varies a lot by what sort of model we are trying to develop, but I want to share with you one example to show you the sort of insights that can be achieved and how to do it.
In one project we had access to location information from a set of cars. The data set was relatively limited; we only had 1.5-2 months data from each car. On the other had the data was rather voluminous; the car data was reported every second including the car’s current location. The data was only reported when the car was turned on (sort of obvious, but worth mentioning), so we could look at the data as being associated with the movement of a car over these 1.5-2 months.
The approach I took to wrangle this data was to consider journeys between places, but not the path taken between places. The rational I had was that I did not really care which route the driver took to reach a location, I was more interested in knowing where they were going and coming from. I my personal experience I can take different paths on different days to avoid traffic, but I usually end up at the same place, like work, home, shopping for groceries, etc. I was interested to know where the cars were going, and if I could derive some insights about these habitual places.
Step 1: Making Journeys
While in some cases a journey of a thousand miles can begin with a single step, for me they began with creating some internal rules about what constituted a journey. Given a set of time gapped location information, I applied the following rule:
Journeys are defined to be a gap of 10 minutes in the data records, with a travel distance > 500m based on the odometer change.
The logic behind this rule was to avoid cars that redeploy over short distance (like moving from one side of the street to another to avoid a ticket) or make a short stop in the middle of a journey (like to drop someone off on the way home or stop to fill up for gas). On the other hand, I fully admit they do seem a little arbitrary. Why not 15 minutes? Or 5 minutes? Or greater than 300 meters? Yup, all good points, but I think at the end of the day we are just quibbling about the parameters not the approach. In a perfect world I would probably try a few different parameters ranges to see what sort of convergence would work with the data set. In the rather imperfect world I live in, I had a few days to complete the analysis so I just went with my rule.
At this point then it becomes rather mechanical to create the journeys. As you can see in this R code snippet, the process is:
Calculate the time difference between each observation and store it associated with the most recent observation (I.e. differencei = timei – timei-1).
Iterate through the location data, assign a journey number (starting at 1) until a > 10-minute gap is detected, in which case increment the journey number and assign it to the next point (since it is the start of another journey.
Step 2: Finding places
Now with a list of journeys, the next step is to find the places. The assumption here is that many of the journeys were to and from similar places, even if the absolute location is different. For example, when I go to work, I part in a different parking spot depending on when I arrive, and it could be 100-200 meters apart, but fundamentally I am still at “Work”. And if when I go home and I have to park on the street, well I might end up parking in different location close to my house, but I am still at “Home”.
The approach I took was to consider a density-based clustering of journey start and end locations, to find common places in the journeys. I used DBSCAN (link) algorithm and had to provide values for minimum number of locations in a place cluster, as well as an epsilon (I.e. distance measure between locations) to consider these locations as being the same cluster. My values were 3 points in a cluster, and location points within approximately 550 meters. This value of epsilon generated what (visually) seemed to be a reasonable set of places.
Here is the R code I used:
1) Here the latitude and longitude (lat and lon) of all the journeys start and stop locations is formed into a data frame, and the duplicates (if any) removed. At the end of the process for this car, there were 293 unique locations.
2) Here the call to DBSCAN (comment error…0.005 epsilon for lat/lon is approximate 550-555 meters (link))
3) Here you can see the cluster results. Cluster 0 is the points that were not clustered, and clusters 1 through 15 represent valid clusters or in my case … places. For example, cluster 1 has 162 locations that are considered to be in the same place – which we would calculate as the mean value of latitude and longitude.
For a visual interpretation, you can consider this image where the locations are plotted on a map, and the locations are superimposed as coloured circles (colour coded for cluster).
Now we have the places that this car goes, not just it’s locations. In a future post, I will talk about how to interpret these locations, and try to find ways to detect predictability in their journeys.
The other day I happened to notice the Microsoft’s OneDrive software had graciously went through my photos and tag them based on what it thought was the content of the photo. Slightly irritated (I did not ask it to tag the photos) I scrolled through the tags to find the follow picture of my beloved late dog Hiko:
As some of you are aware, when training pattern recognition neural networks a series of contrasting photos are shown to allow the algorithm to learn what it is seeking. In come cases people use cat and dog images (example here) to build such a detection algorithm. Clearly, Microsoft’s OneDrive algorithm needs some tuning.
When I mentioned this to a colleague, he proceeded to run the same picture of my dog through his own cat/dog deep learning system… and pronounced that it also classified my dog as a cat.
After a few laughs around the office, it struck me that in lieu of some significant ground truth (like, I lived with this dog for 13 years and can vouch his dog-ness) it would be hard to argue against 2 independent algorithms using the same information to come to the same conclusion. Imagine if some algorithms got together and decided I was prone to criminality. Or maybe that you would be a poor choice for a job. Or as a parent. In these less black and white situations, the independent results of 2 algorithms would be hard to argue against, especially if we don’t know how the decision was arrived at.
In the latest report from the New York University’s’ AI Now Institute, (report on Medium here) there are 10 recommendations regarding improving the equity responsibility AI algorithms and their societal applications. These range from limiting use of black box algorithms (like the one used for my dog) to improving the quality of the datasets and trained algorithms, including regular auditing.
For those of you working actively in the AI field, take heed.
In a previous post, I mentioned that we did some analysis to detect IoT devices versus human users. The labelling was based on the International Mobile Equipment Identity (IMEI) Type Allocation Code (TAC). From a database of TAC values there is field per device that specifies the type of device. We allocated devices of type “mobile phone” to be used by humans, and devices of type “M2M” or “module” to IoT devices. We left out “Router”, “Tablet”, “Dongle” and “unknown” since it was not so clear if these were humans or machines. In lieu of some ground truth, this seems like a reasonable approach.
In a dataset of 195,000 unique devices taken from a large mobile network operator, we noticed that the majority of the devices were “mobile phone”, which seems to make sense from our understanding of user distribution. When we created a subset with only devices designed as human or machine (IoT), we ended up with 95% of the sample being human.
The full features set for our data had 126 different features, with daily observations for the device usage over a 12-day period. The insights from this analysis was machines and humans have different levels for:
Average Revenue Per Users (ARPU) level (ordinal ranking from our collection system)
humans higher than machines
Data download (DL) usage
most machines do not have any DL reports over the 12-day period
Internet service usage
most machines do not have any service usage (makes sense since they have no data DL).
As mentioned in the other post, I made a classifier based on ARPU levels and the presence of down-loaded data, which was reasonable accurate. But there were significant minorities in each group that act as the other and contributed to the error in the classifier. I named these error groups as:
Humanoids: Machines that act like humans (8.31% of the machines). These are devices that download data like users of a mobile phone, and have significant internet service usage.
Cyborgs: Humans that act like machines (10.84% of the humans). These are humans using mobile devices and never/very rarely download data or use internet services. People that use their smart phones to make calls and send SMS but never connect to data.
A little digging into the data yielded some insights about these groups.
On investigation of a few Humanoids, we found that they were modules that could be used in laptop computers or IoT devices. In this case, if these modules have dual use, then it makes sense that devices with these modules could be human or IoT.
In the case of Cyborgs, it was a little less clear because all them had smart phones, so in theory they should be using data services. However, in another recent investigation with an operator, we were able to find approx 18% of the subscribers had no significant data usage, despite having a data plan. It seems our Cyborgs are humans that are non-users of internet technology. Much like the 13% of Americans that still don’t use the internet (see my other post). This begs the question, “Why not?”, but I don’t have any answer for that at this time.
The last thing I wanted to mention was the relative utility of using IMEI TAC to identify human versus IoT users in a mobile network. Before we present these results to folks in the industry, most would affirm that IMEI TAC is a good way to identify devices versus humans. But because there are dual use devices for IoT and humans, this not a very good way to classify. In fact, for the 22 different device types in our sample that were considered “Machine” devices:
60.0% were used by devices that only acted like IoT machines
8.6% were used by devices that only acted like humans
31.4% were used in devices that acted like humand AND IoT devices.
Moral of the story: IMEI TAC does not tell you with accuracy if a device is an IoT device or not. And a lot of humans don’t surf the web on their mobile devices.
Here is graphic of the relative allocation of humanoid and cyborg device information.
I have been actively working as a professional data scientist for about 12 months now, after exchanging my previous career as innovation/advanced R&D manager that I held for over 18 years. On a personal level I find it very intellectually stimulating (some old dogs like to learn new tricks). But it is a chance for to re-acquaint myself with mathematical analysis, something I studied in school and worked on in my first job.
Of course, things have changed a lot over the past 30 years since got my degree in math, and it is very nice to see the power of tools like R and Python to easily chunk through numbers and great fantastic visualizations. But old habits die hard, and I like to get up and personal with the numbers compared to some of my colleagues who readily throw the datasets at different algorithms to extract some insights.
In a recent project, three of us approach a similar problem, but from three different technological angles. The problem was to find a classifier algorithm to detect IoT devices (“machines”) from human subscribers (“humans”) in set of network data provided from a mobile operator. The data set was very unbalanced, with about 5% of the data considered machines, and pretty messy – as per usual.
My approach was to examine the data, using summary measurements and visualization. And based on that, I was able to craft my own classifier out of some simple rules derived from the data analysis.
Basic Data Examination
I work in R and there are a few “go to” functions for this work:
str(): Structure of an object (link). This function can be run on any R object, and produce information about the type(s) in the object, and give a preview of a few observation values:
head()/tail(): Returns the first or last parts of a vector, matrix, table, data frame or function (link). Good to see some examples of features in a data frame. If you have a lot of features, you can use the transpose function (t()) to put the features into rows.
summary(): A generic function used to produce result summaries (a bit circular…) (link): A nice way to examine numeric data to get some ideas on the spread and look of the data. The results include the quartile information, max/min and mean/median.
hist(): Histogram (link). The bread and butter function of data science examination, a histogram. Commonly used for vectors of numbers, it provides frequency information on the data across a number of bins. But don’t try this on a specific non-numeric field from a data frame, or you will get an error:
hist.default(df.sample.output$sli_neg_impact_svc_top_eea): ‘x’ must be numeric ”
Table(): A contingency table of cross-classifying factors counts (link). This function looks at frequency of feature values for categorical data. But be warned: the default is *not* to show NAs. This can lead to some mis-understanding of the data. Don’t forget the qualifier “ifany”.
Not so Basic Data Examination
The functions above come ‘out of the box’ with R. If you poke around the internet, you will find other functions that can summarize data frames in different ways.
A function I recently discovered from package “Hmsic” is describe(). This function is cool! It takes a data frame and digs into it to provide some a well-rounded summary.
It identifies the number of features and observations, and then for each feature provide some summary statistics based on type.
n: number of observations not NA
missing: number of observation that are NA
distinct: number of different values of the feature across the observations
(numeric) Info: How much information can be obtained from the feature (see description for exact details). A feature that has a wide range of values that are not tied in the sample has a higher number, than a feature that only has 1 or 2 values that are widely shared in the observations. For example, the feature below has only 1 value in all the observations, or NA. This is pretty useless to understand what makes some observations different from others.
(numeric) Gmd (Gini mean difference): Also known as Mean Absolute Difference (MAD). Like it says on the label, it is a metric composed of the mean absolute difference in the data. A way to assess the ‘spread’ of the data; wider spread – more data variability, less spread – less variability. Unlike standard deviation, it is not a measure of spread from the (supposed) central measure of the mean.
(numeric) 5,10,25,50,75,95 percentiles
(continuous data) Lowest/highest 5 values in the data
(non-numeric, discrete values or categorical factors) frequency & probability of each value in the observations
Below are some examples…of
Continuous numeric feature
Discrete numeric feature
A feature with little information to show (has only 1 value in the observations)
Data Examination of the Machine Human Dataset
The first thing I did was create a statistically significant sample from the overall data set of 194,602 unique devices. The sample set is based on a 99% confidence level +/- 3% on the machines, and a proportional number of humans as in the actual data set (i.e. 5% machines, 95% humans). Our data set is pretty rich, capturing 125 features related to network usage. The data was collected over a 2-week period, but I chose to consider a single day view of the data to see if with only 1 day information I could determine if a device was used by a human or a machine.
However, when I started looking at the data, there were a lot of NA fields in the observations. It is pretty standard practice to remove observations with lots of NAs, or put them to some null value (like ‘0’), but in this case the absence of information was in fact information.
Consider the figure below, where I ordered (by machine data) the percentage of rows that had NA for a feature.
The top 5 features include only 2 usage features, “arpu_grp_eea” and “data_dl_dy_avg_14d” — the rest are either identity related (“id”, “type”, “imeitac_last_eea”) or date related (“date_1d”).
There is a clear break starting right with “data_dl_dy_avg_14d”. At this feature, we see humans still have data (around 86%) but only 8% of machines have valid information. And it gets worse going down from there. Now with a little application insight I can tell you that the remaining features are based on having data downloaded to a device, so the fact that they are less present when a device does not down load data is not a big surprise. In fact, you could even reach a conclusion that IF a device was downloading data, THEN there is a strong probability that it is a human using the device. So the absence of information in this case is something.
Next, I examined the feature “arpu_grp_eea”, which was present in most machines and humans. Consider the histogram below of human and machine “arpu_grp_eea” levels.
The humans tend to have higher levels, and machine lower levels.
Based on only these observations, I derived a simple (or simplistic!) classifier rule:
Machines are devices with:
ARPU level ≤ 2 &
no data download activity.
I took this rule out for a test drive on a larger version of the data set (194,602 records), which is pretty unbalanced (5.1% machines, 94.9% humans). In this case, prediction failure (as percentage) is not evenly weighted; 1% misprediction of humans as machines will misclassify 1847 devices, whereas a failure rate of 1% of machines as humans will misclassify 99 devices. A good classifier should have errors that have a similar effect, not on ratios but on absolute errors
And here are the results:
You can also see the Confusion Matrix and the 4-fold graph here
By two other colleagues approached this problem in a different way, with a different perspective on the data set. They got more accurate results, but they also chose a different perspective on the problems. Please take a look at Marc-Olivier and Pascal posts on how they approached this.
Challenges to innovation acceptance within an organization
I have worked within a large international technology for many years and collaborated with other folks — within and outside the company – on product innovation. While there are many differences in products, cultures (corporate and societal), most of the people I met can all agree that internal innovation is a hard sell.
When I started out championing internal innovation 20 years ago I was naïve enough at the time to feel that innovation would be welcome. Sure, some people in the company would have their noses out of joint – probably those that did not think about the idea or who’s products would be affected by the changes – but management and far seeing people would see the wisdom and support these projects. And for all projects there *was* support from far seeing managers to get the innovation off the ground (sort of like internal angel investors), but rather than receiving roses at the end of the project it was always rocks and sticks. I can be pretty stubborn by nature (reveal: my sport is long distance running) but after a few years the rejection was getting me down.
So I examined what experts had to say about innovation, to better understand the situation I was in.
Here is an insight on the problem from an old management consultant:
“It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to management than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institution and merely lukewarm defenders in those who gain by the new ones.”
The Prince, Nicolo Machiavelli
And here are some other words of advice from the man that is said to have invented modern management
“It is not size that is an impediment to entrepreneurship and innovation; it is the existing operation itself, and especially the existing successful operation.”
“Operating anything …. requires constant effort and unremitting attention. The only thing that can be guaranteed in any kind of operation is the daily crisis. The daily crisis cannot be postponed, it has to be dealt with right away. And the existing operation demands high priority and deserves it.
The new always looks so small, so puny, so unpromising next to the size and performance of maturity. … The odds are heavily stacked against it succeeding.”
Innovation and Entrepreneurship , Peter Drucker,
Clearly I should have been expecting the bricks all the time.
But after some further reflection, there are some specific reasons why internal innovation in large companies is not welcome.
Negates investments in physical & personal capital
Silicon Valley’s hippy values ‘killing music industry’,
Paul McGuinness, U2 Manager (Guardian.co.uk, January 2008)
Experts like to stay experts since the perks that come with the job are quite good (salary, prestige). Someone coming with a new way to do something that removes the need for expertise – well they are not welcome. Same is true for product managers – they don’t want to hear that their cash cow is dead with this new innovation. Or the new servers we bought are now obsolete. Shooting the messenger is de rigueur in this situation.
Upsets the existing hierarchy and power base
“When Henry Ford made cheap, reliable cars people said, ‘Nah, what’s wrong with a horse?’ That was a huge bet he made, and it worked.”
Mainframes vs PCs; CD/DVDs vs streaming; electric cars vs oil cars. There are going to be winners and losers, and innovation is the catalyst for change.
It is a drain on resources
A company always has a crisis going on (usually several) and it needs all the good, talented people to help resolve the issues. The last thing a manager handling the crisis wants to hear is that there is a skunk work project that is sucking up the people and resources they could use to solve the crisis. And even if they project is “ring fenced” to prevent poaching, there will be constant comments in the management meetings that the company does not have the right “focus” on the crisis, since “out best people are not engaged”.
Is often an instrument of change
“People hate change because people hate change”
Tom Demarco, Peopleware
It sort is in the nature of innovation to introduce change. And instinctively we resist change, for good and bad reasons. Even changes that bring longer term benefit to most people have their resistors, even long after the change has been accomplished. As of 2016, 13% of Americans do not use the internet. And that level has not changed for 3 years. And now vinyl records are making a comeback, which will make happy the folks that never embraced CDs.
And beside these reasons, all the innovator has to offer is (typically) an idea and possibly a proto-type with limited ability to make money for the organization in the next quarter. Sort of like shark testing a baby lamb when you think about it.
But all is not lost. In subsequent posts I will talk about some strategies to overcome the resistance and get an innovation to market within a large company. Just don’t expect any flowers along the way.