I have been actively working as a professional data scientist for about 12 months now, after exchanging my previous career as innovation/advanced R&D manager that I held for over 18 years. On a personal level I find it very intellectually stimulating (some old dogs like to learn new tricks). But it is a chance for to re-acquaint myself with mathematical analysis, something I studied in school and worked on in my first job.
Of course, things have changed a lot over the past 30 years since got my degree in math, and it is very nice to see the power of tools like R and Python to easily chunk through numbers and great fantastic visualizations. But old habits die hard, and I like to get up and personal with the numbers compared to some of my colleagues who readily throw the datasets at different algorithms to extract some insights.
In a recent project, three of us approach a similar problem, but from three different technological angles. The problem was to find a classifier algorithm to detect IoT devices (“machines”) from human subscribers (“humans”) in set of network data provided from a mobile operator. The data set was very unbalanced, with about 5% of the data considered machines, and pretty messy – as per usual.
My approach was to examine the data, using summary measurements and visualization. And based on that, I was able to craft my own classifier out of some simple rules derived from the data analysis.
Basic Data Examination
I work in R and there are a few “go to” functions for this work:
str(): Structure of an object (link). This function can be run on any R object, and produce information about the type(s) in the object, and give a preview of a few observation values:
head()/tail(): Returns the first or last parts of a vector, matrix, table, data frame or function (link). Good to see some examples of features in a data frame. If you have a lot of features, you can use the transpose function (t()) to put the features into rows.
summary(): A generic function used to produce result summaries (a bit circular…) (link): A nice way to examine numeric data to get some ideas on the spread and look of the data. The results include the quartile information, max/min and mean/median.
hist(): Histogram (link). The bread and butter function of data science examination, a histogram. Commonly used for vectors of numbers, it provides frequency information on the data across a number of bins. But don’t try this on a specific non-numeric field from a data frame, or you will get an error:
hist.default(df.sample.output$sli_neg_impact_svc_top_eea): ‘x’ must be numeric ”
Table(): A contingency table of cross-classifying factors counts (link). This function looks at frequency of feature values for categorical data. But be warned: the default is *not* to show NAs. This can lead to some mis-understanding of the data. Don’t forget the qualifier “ifany”.
Not so Basic Data Examination
The functions above come ‘out of the box’ with R. If you poke around the internet, you will find other functions that can summarize data frames in different ways.
A function I recently discovered from package “Hmsic” is describe(). This function is cool! It takes a data frame and digs into it to provide some a well-rounded summary.
It identifies the number of features and observations, and then for each feature provide some summary statistics based on type.
- n: number of observations not NA
- missing: number of observation that are NA
- distinct: number of different values of the feature across the observations
- (numeric) Info: How much information can be obtained from the feature (see description for exact details). A feature that has a wide range of values that are not tied in the sample has a higher number, than a feature that only has 1 or 2 values that are widely shared in the observations. For example, the feature below has only 1 value in all the observations, or NA. This is pretty useless to understand what makes some observations different from others.
- (numeric) Gmd (Gini mean difference): Also known as Mean Absolute Difference (MAD). Like it says on the label, it is a metric composed of the mean absolute difference in the data. A way to assess the ‘spread’ of the data; wider spread – more data variability, less spread – less variability. Unlike standard deviation, it is not a measure of spread from the (supposed) central measure of the mean.
- (numeric) Mean
- (numeric) 5,10,25,50,75,95 percentiles
- (continuous data) Lowest/highest 5 values in the data
- (non-numeric, discrete values or categorical factors) frequency & probability of each value in the observations
Below are some examples…of
Continuous numeric feature
Discrete numeric feature
A feature with little information to show (has only 1 value in the observations)
Data Examination of the Machine Human Dataset
The first thing I did was create a statistically significant sample from the overall data set of 194,602 unique devices. The sample set is based on a 99% confidence level +/- 3% on the machines, and a proportional number of humans as in the actual data set (i.e. 5% machines, 95% humans). Our data set is pretty rich, capturing 125 features related to network usage. The data was collected over a 2-week period, but I chose to consider a single day view of the data to see if with only 1 day information I could determine if a device was used by a human or a machine.
However, when I started looking at the data, there were a lot of NA fields in the observations. It is pretty standard practice to remove observations with lots of NAs, or put them to some null value (like ‘0’), but in this case the absence of information was in fact information.
Consider the figure below, where I ordered (by machine data) the percentage of rows that had NA for a feature.
The top 5 features include only 2 usage features, “arpu_grp_eea” and “data_dl_dy_avg_14d” — the rest are either identity related (“id”, “type”, “imeitac_last_eea”) or date related (“date_1d”).
There is a clear break starting right with “data_dl_dy_avg_14d”. At this feature, we see humans still have data (around 86%) but only 8% of machines have valid information. And it gets worse going down from there. Now with a little application insight I can tell you that the remaining features are based on having data downloaded to a device, so the fact that they are less present when a device does not down load data is not a big surprise. In fact, you could even reach a conclusion that IF a device was downloading data, THEN there is a strong probability that it is a human using the device. So the absence of information in this case is something.
Next, I examined the feature “arpu_grp_eea”, which was present in most machines and humans. Consider the histogram below of human and machine “arpu_grp_eea” levels.
The humans tend to have higher levels, and machine lower levels.
Based on only these observations, I derived a simple (or simplistic!) classifier rule:
Machines are devices with:
- ARPU level ≤ 2 &
- no data download activity.
I took this rule out for a test drive on a larger version of the data set (194,602 records), which is pretty unbalanced (5.1% machines, 94.9% humans). In this case, prediction failure (as percentage) is not evenly weighted; 1% misprediction of humans as machines will misclassify 1847 devices, whereas a failure rate of 1% of machines as humans will misclassify 99 devices. A good classifier should have errors that have a similar effect, not on ratios but on absolute errors
And here are the results:
You can also see the Confusion Matrix and the 4-fold graph here
By two other colleagues approached this problem in a different way, with a different perspective on the data set. They got more accurate results, but they also chose a different perspective on the problems. Please take a look at Marc-Olivier and Pascal posts on how they approached this.