Month: October 2017

This is not a cat

The other day I happened to notice the Microsoft’s OneDrive software had graciously went through my photos and tag them based on what it thought was the content of the photo.  Slightly irritated (I did not ask it to tag the photos) I scrolled through the tags to find the follow picture of my beloved late dog Hiko:


As some of you are aware, when training pattern recognition neural networks a series of contrasting photos are shown to allow the algorithm to learn what it is seeking.  In come cases people use cat and dog images (example here) to build such a detection algorithm.  Clearly, Microsoft’s OneDrive algorithm needs some tuning.

When I mentioned this to a colleague, he proceeded to run the same picture of my dog through his own cat/dog deep learning system… and pronounced that it also classified my dog as a cat.

After a few laughs around the office, it struck me that in lieu of some significant ground truth (like, I lived with this dog for 13 years and can vouch his dog-ness) it would be hard to argue against 2 independent algorithms using the same information to come to the same conclusion.  Imagine if some algorithms got together and decided I was prone to criminality.  Or maybe that you would be a poor choice for a job.  Or as a parent.  In these less black and white situations, the independent results of 2 algorithms would be hard to argue against, especially if we don’t know how the decision was arrived at.

In the latest report from the New York University’s’ AI Now Institute, (report on Medium here) there are 10 recommendations regarding improving the equity  responsibility AI algorithms and their societal applications.  These range from limiting use of black box algorithms (like the one used for my dog) to improving the quality of the datasets and trained algorithms, including regular auditing.

For those of you working actively in the AI field, take heed.


On Humanoids and Androids


In a previous post, I mentioned that we did some analysis to detect IoT devices versus human users.  The labelling was based on the International Mobile Equipment Identity (IMEI) Type Allocation Code (TAC).  From a database of TAC values there is field per device that specifies the type of device.  We allocated devices of type “mobile phone” to be used by humans, and devices of type “M2M” or “module” to IoT devices.  We left out “Router”, “Tablet”, “Dongle” and “unknown” since it was not so clear if these were humans or machines.  In lieu of some ground truth, this seems like a reasonable approach.  

In a dataset of 195,000 unique devices taken from a large mobile network operator, we noticed that the majority of the devices were “mobile phone”, which seems to make sense from our understanding of user distribution.  When we created a subset with only devices designed as human or machine (IoT), we ended up with 95% of the sample being human. 

distribution of all devices types in the data set
distribution of human (“mobile phone” versus IoT device (“machine”)

The full features set for our data had 126 different features, with daily observations for the device usage over a 12-day period.   The insights from this analysis was machines and humans have different levels for: 

Average Revenue Per Users (ARPU) level (ordinal ranking from our collection system) 

  • humans higher than machines 

Data download (DL) usage 

  • most machines do not have any DL reports over the 12-day period 

Internet service usage 

  • most machines do not have any service usage (makes sense since they have no data DL). 

As mentioned in the other post, I made a classifier based on ARPU levels and the presence of down-loaded data, which was reasonable accurate.  But there were significant minorities in each group that act as the other and contributed to the error in the classifier.  I named these error groups as: 

  • Humanoids: Machines that act like humans (8.31% of the machines).  These are devices that download data like users of a mobile phone, and have significant internet service usage. 
  • Cyborgs: Humans that act like machines (10.84% of the humans).  These are humans using mobile devices and never/very rarely download data or use internet services. People that use their smart phones to make calls and send SMS but never connect to data. 

A little digging into the data yielded some insights about these groups.   

On investigation of a few Humanoids, we found that they were modules that could be used in laptop computers or IoT devices.  In this case, if these modules have dual use, then it makes sense that devices with these modules could be human or IoT.  

In the case of Cyborgs, it was a little less clear because all them had smart phones, so in theory they should be using data services.  However, in another recent investigation with an operator, we were able to find approx 18% of the subscribers had no significant data usage, despite having a data plan.  It seems our Cyborgs are humans that are non-users of internet technology.  Much like the 13% of Americans that still don’t use the internet (see my other post).  This begs the question, “Why not?”, but I don’t have any answer for that at this time. 

The last thing I wanted to mention was the relative utility of using IMEI TAC to identify human versus IoT users in a mobile network.  Before we present these results to folks in the industry, most would affirm that IMEI TAC is a good way to identify devices versus humans.  But because there are dual use devices for IoT and humans, this not a very good way to classify.  In fact, for the 22 different device types in our sample that were considered “Machine” devices: 

  • 60.0% were used by devices that only acted like IoT machines 
  • 8.6% were used by devices that only acted like humans 
  • 31.4% were used in devices that acted like humand AND IoT devices. 

Moral of the story: IMEI TAC does not tell you with accuracy if a device is an IoT device or not.  And a lot of humans don’t surf the web on their mobile devices. 

Here is graphic of the relative allocation of humanoid and cyborg device information. 


Data Examination and Home Made Classifiers

I have been actively working as a professional data scientist for about 12 months now, after exchanging my previous career as innovation/advanced R&D manager that I held for over 18 years.  On a personal level I find it very intellectually stimulating (some old dogs like to learn new tricks). But it is a chance for to re-acquaint myself with mathematical analysis, something I studied in school and worked on in my first job.

Of course, things have changed a lot over the past 30 years since got my degree in math, and it is very nice to see the power of tools like R and Python to easily chunk through numbers and great fantastic visualizations.  But old habits die hard, and I like to get up and personal with the numbers compared to some of my colleagues who readily throw the datasets at different algorithms to extract some insights.

In a recent project, three of us approach a similar problem, but from three different technological angles. The problem was to find a classifier algorithm to detect IoT devices (“machines”) from human subscribers (“humans”) in set of network data provided from a mobile operator. The data set was very unbalanced, with about 5% of the data considered machines, and pretty messy – as per usual.

My approach was to examine the data, using summary measurements and visualization.  And based on that, I was able to craft my own classifier out of some simple rules derived from the data analysis.

Basic Data Examination

I work in R and there are a few “go to” functions for this work:

str(): Structure of an object (link).  This function can be run on any R object, and produce information about the type(s) in the object, and give a preview of a few observation values:

head()/tail(): Returns the first or last parts of a vector, matrix, table, data frame or function (link).  Good to see some examples of features in a data frame.  If you have a lot of features, you can use the transpose function (t()) to put the features into rows.

summary(): A generic function used to produce result summaries (a bit circular…) (link):  A nice way to examine numeric data to get some ideas on the spread and look of the data.  The results include the quartile information, max/min and mean/median.

hist(): Histogram (link). The bread and butter function of data science examination, a histogram. Commonly used for vectors of numbers, it provides frequency information on the data across a number of bins.  But don’t try this on a specific non-numeric field from a data frame, or you will get an error:

hist.default(df.sample.output$sli_neg_impact_svc_top_eea): ‘x’ must be numeric ”

Table(): A contingency table of cross-classifying factors counts (link).  This function looks at frequency of feature values for categorical data.  But be warned:  the default is *not* to show NAs.  This can lead to some mis-understanding of the data.  Don’t forget the qualifier “ifany”.

Not so Basic Data Examination

The functions above come ‘out of the box’ with R.  If you poke around the internet, you will find other functions that can summarize data frames in different ways.

A function I recently discovered from package “Hmsic” is describe().  This function is cool!  It takes a data frame and digs into it to provide some a well-rounded summary.

It identifies the number of features and observations, and then for each feature provide some summary statistics based on type.

  • n: number of observations not NA
  • missing: number of observation that are NA
  • distinct: number of different values of the feature across the observations
  • (numeric) Info: How much information can be obtained from the feature (see description for exact details).  A feature that has a wide range of values that are not tied in the sample has a higher number, than a feature that only has 1 or 2 values that are widely shared in the observations.  For example, the feature below has only 1 value in all the observations, or NA.  This is pretty useless to understand what makes some observations different from others.
  • (numeric) Gmd (Gini mean difference):  Also known as Mean Absolute Difference (MAD).  Like it says on the label, it is a metric composed of the mean absolute difference in the data.  A way to assess the ‘spread’ of the data; wider spread – more data variability, less spread – less variability.  Unlike standard deviation, it is not a measure of spread from the (supposed) central measure of the mean.
  • (numeric) Mean
  • (numeric) 5,10,25,50,75,95 percentiles
  • (continuous data) Lowest/highest 5 values in the data
  • (non-numeric, discrete values or categorical factors) frequency & probability of each value in the observations

Below are some examples…of

Continuous numeric feature

Discrete numeric feature

Non-numeric feature

A feature with little information to show (has only 1 value in the observations)

Data Examination of the Machine Human Dataset

The first thing I did was create a statistically significant sample from the overall data set of 194,602 unique devices.  The sample set is based on a 99% confidence level +/- 3% on the machines, and a proportional number of humans as in the actual data set (i.e. 5% machines, 95% humans).   Our data set is pretty rich, capturing 125 features related to network usage.  The data was collected over a 2-week period, but I chose to consider a single day view of the data to see if with only 1 day information I could determine if a device was used by a human or a machine.

However, when I started looking at the data, there were a lot of NA fields in the observations.  It is pretty standard practice to remove observations with lots of NAs, or put them to some null value (like ‘0’), but in this case the absence of information was in fact information.

Consider the figure below, where I ordered (by machine data) the percentage of rows that had NA for a feature.

The top 5 features include only 2 usage features, “arpu_grp_eea” and “data_dl_dy_avg_14d” — the rest are either identity related (“id”, “type”, “imeitac_last_eea”) or date related (“date_1d”).

There is a clear break starting right with “data_dl_dy_avg_14d”.  At this feature, we see humans still have data (around 86%) but only 8% of machines have valid information.  And it gets worse going down from there.  Now with a little application insight I can tell you that the remaining features are based on having data downloaded to a device, so the fact that they are less present when a device does not down load data is not a big surprise.  In fact, you could even reach a conclusion that IF a device was downloading data, THEN there is a strong probability that it is a human using the device.  So the absence of information in this case is something.

Next, I examined the feature “arpu_grp_eea”, which was present in most machines and humans. Consider the histogram below of human and machine “arpu_grp_eea” levels.

The humans tend to have higher levels, and machine lower levels.

Based on only these observations, I derived a simple (or simplistic!) classifier rule:

Machines are devices with:

  • ARPU level ≤ 2 &
  • no data download activity.

I took this rule out for a test drive on a larger version of the data set (194,602 records), which is pretty unbalanced (5.1% machines, 94.9% humans). In this case, prediction failure (as percentage) is not evenly weighted; 1% misprediction of humans as machines will misclassify 1847 devices, whereas a failure rate of 1% of machines as humans will misclassify 99 devices. A good classifier should have errors that have a similar effect, not on ratios but on absolute errors

And here are the results:

You can also see the Confusion Matrix and the 4-fold graph here

By two other colleagues approached this problem in a different way, with a different perspective on the data set.  They got more accurate results, but they also chose a different perspective on the problems. Please take a look at Marc-Olivier and Pascal posts on how they approached this.

That’s a good idea, but…

Challenges to innovation acceptance within an organization 

I have worked within a large international technology for many years and collaborated with other folks — within and outside the company – on product innovation.  While there are many differences in products, cultures (corporate and societal), most of the people I met can all agree that internal innovation is a hard sell.

When I started out championing internal innovation 20 years ago I was naïve enough at the time to feel that innovation would be welcome.  Sure, some people in the company would have their noses out of joint – probably those that did not think about the idea or who’s products would be affected by the changes – but management and far seeing people would see the wisdom and support these projects.  And for all projects there *was* support from far seeing managers to get the innovation off the ground (sort of like internal angel investors), but rather than receiving roses at the end of the project it was always rocks and sticks.  I can be pretty stubborn by nature (reveal: my sport is long distance running) but after a few years the rejection was getting me down.

So I examined what experts had to say about innovation, to better understand the situation I was in.

Here is an insight on the problem from an old management consultant:

“It must be remembered that there is nothing more difficult to plan, more doubtful of success, nor more dangerous to management than the creation of a new system. For the initiator has the enmity of all who would profit by the preservation of the old institution and merely lukewarm defenders in those who gain by the new ones.” 

The Prince, Nicolo Machiavelli 

And here are some other words of advice from the man that is said to have invented modern management

“It is not size that is an impediment to entrepreneurship and innovation; it is the existing operation itself, and especially the existing successful operation.” 

“Operating anything …. requires constant effort and unremitting attention.  The only thing that can be guaranteed in any kind of operation is the daily crisis. The daily crisis cannot be postponed, it has to be dealt with right away.  And the existing operation demands high priority and deserves it.  

The new always looks so small, so puny, so unpromising next to the size and performance of maturity. … The odds are heavily stacked against it succeeding.” 

Innovation and Entrepreneurship , Peter Drucker,  

Clearly I should have been expecting the bricks all the time.

But after some further reflection, there are some specific reasons why internal innovation in large companies is not welcome.

Negates investments in physical & personal capital

Silicon Valley’s hippy values ‘killing music industry’,

Paul McGuinness, U2 Manager (, January 2008)

Experts like to stay experts since the perks that come with the job are quite good (salary, prestige). Someone coming with a new way to do something that removes the need for expertise – well they are not welcome.  Same is true for product managers – they don’t want to hear that their cash cow is dead with this new innovation.  Or the new servers we bought are now obsolete. Shooting the messenger is de rigueur in this situation.

Upsets the existing hierarchy and power base

“When Henry Ford made cheap, reliable cars people said, ‘Nah, what’s wrong with a horse?’ That was a huge bet he made, and it worked.”

Elon Musk. 

Mainframes vs PCs; CD/DVDs vs streaming; electric cars vs oil cars.  There are going to be winners and losers, and innovation is the catalyst for change.

It is a drain on resources

A company always has a crisis going on (usually several) and it needs all the good, talented people to help resolve the issues.  The last thing a manager handling the crisis wants to hear is that there is a skunk work project that is sucking up the people and resources they could use to solve the crisis.   And even if they project is “ring fenced” to prevent poaching, there will be constant comments in the management meetings that the company does not have the right “focus” on the crisis, since “out best people are not engaged”.

Is often an instrument of change

“People hate change because people hate change”

Tom Demarco, Peopleware

It sort is in the nature of innovation to introduce change.  And instinctively we resist change, for good and bad reasons.  Even changes that bring longer term benefit to most people have their resistors, even long after the change has been accomplished.  As of 2016, 13% of Americans do not use the internet.  And that level has not changed for 3 years.   And now vinyl records are making a comeback, which will make happy the folks that never embraced CDs.

And beside these reasons, all the innovator has to offer is (typically) an idea and possibly a proto-type with limited ability to make money for the organization in the next quarter.  Sort of like shark testing a baby lamb when you think about it.

But all is not lost.  In subsequent posts I will talk about some strategies to overcome the resistance and get an innovation to market within a large company.  Just don’t expect any flowers along the way.