February 2018 – Technology Beyond the Hype

In the previous blog post (link) I talked about how to convert location data into places information based on a density based clustering of the journey start and end locations from a set of cars\vehicles over a two month period. In this post I am going to explore what information we can conclude from the place information.

Being the good data scientists that we all are, the natural thing to reach for when examining some data for the first time is usual a histogram. For both numeric and categorical data a histogram representation usually gives some insights about spread of the data and the density or frequency with respect to range bins or categories. In the image below, you can see histograms of the place clusters for 3 cars with the y-axis representing the percentage of the journeys that either had a stop or start in a place cluster.

figure 1 – sample 3 car place frequency distribution

For these cars, we can get some interesting insights from these histograms.

Highest frequency clusters

In all the histograms (and in fact for the majority of the cars I studied) there was one cluster place that had the majority of the start and stop. For all the cars shown above this is cluster 1. Now if this was my personal car, I would call this place “home” — where I live. For service vehicles (like delivery vans) this may be the store or factory. For rental cars, this could be the rental car agency location. Whatever the situation, it is the place where the car ends up most of the time. There are other peaks as well, like cluster 2 for car 1. If this was a residential car and I think about my own behaviour, I would call this place “work”. Of course this analysis is speculative unless there is some additional data we can bring in, which I will talk about later on.

Cluster “0”

Recall the previous blog that place cluster “0” represents the locations that were not associated with any other cluster, nor could form themselves into a cluster. Any car that a lot of cluster “0” journey starts and stops does not spend a lot of time going to the same places. They are not predictable, at least with respect to their journeys.

Compare car 1 and car 2. Car 1 has ~8% of its journeys starting or stopping in non-clustered areas. Car 2 has ~23% outside of clusters. With respect to car 1, car 2 seems less predictable in its driving habits.

Below are two rather pathological examples – cars 19 and 20.

Both of these cars have only 1 cluster (cluster 1), and many or most (in the case of car 20) of their journey start and stops are in unique location – or at least greater than 550 meters away from each other. Without any further insights I would venture they are taxis or possibly rented cars.

Number of clusters

Cars with more clusters go to more places than cars with less clusters – which may seem a rather obvious statement. But another way to consider that information is that a car with less clusters is probably more predictable in the journey origins and destinations, since they go to less places. Consider car 1 and car 3 from figure 1. Approximately 75% of car 1’s journeys start or end in cluster 1 or 2. For car 3, the two most common clusters are 1 and 9, but associated number of starts and ends is only ~48%. If I was looking to find car 1 at any point of time – and without any more information – I would likely find it in place cluster 1 or 2.

Another interesting thing to consider is how the number of clusters relates to the size of cluster “0”. Consider car 2 and car 3 in figure 1; car 2 has more starts and stops outside of common locations than car 3, but car 3 has more clusters (13 versus 9). One could conclude that car 3 “usually” goes to many of the same places, whereas car 2 goes to many different places and less usual places. I have no ground truth here, but car 2 seems more like a taxi whereas car 3 looks more like a delivery van going on a regular delivery route.

Adding the geographical context

If we add a geographical information to the histogram information, we can start to make more inferences about a car’s patterns of journeys. By considering the actual places using a mapping service (like Google maps) and street visualization (like Google Streetview) additional information can be obtained to enrich the place histogram analysis.

Consider the mapping of car 23 places onto Google maps below

figure 3 – mapping place frequency to location analysis

For privacy reasons I have removed the actual location of place 1 and place 2, but I can say that place 1 is a residential location and place 2 was an industrial location. Given the nature of the locations and the fact that place cluster 1 has more start/stops that cluster 2, and the remaining clusters are relatively small, it seems reasonable to call place cluster 1 as “home” and cluster 2 as “work”.

Powerful insights indeed, but somewhat limited by human analysis effort (I.e. considering the relative cluster sizes, looking at the address in Google, going into Streetview, etc). For 1 or 2 or even 20 cars this is manageable, but as part of an automated process, this approach is a bit lacking. I did not explore it but possibly the location enrichment I did with Google can be automated via APIs, nor did I try to make some rules related to histogram analysis, so that that is a possibly something to follow up on in a future project.

In the next blog I go beyond this visualization approach to consider numeric approaches to describe and (hopefully) make some probability statements on journey destinations and origins. The questions I want to get to are:

“If a car is in a particular place at a particular time and day, what is the most likely place it is going to next and when?”

“Is it unusual to have a particular car in this place at this time and day?”

Most of the time I work with data sets on projects I have location information. Either directly reported – like GPS coordinates, or implied – as with phone users connections to mobile cell sites. How to incorporate this location data into a model varies a lot by what sort of model we are trying to develop, but I want to share with you one example to show you the sort of insights that can be achieved and how to do it.

In one project we had access to location information from a set of cars. The data set was relatively limited; we only had 1.5-2 months data from each car. On the other had the data was rather voluminous; the car data was reported every second including the car’s current location. The data was only reported when the car was turned on (sort of obvious, but worth mentioning), so we could look at the data as being associated with the movement of a car over these 1.5-2 months.

The approach I took to wrangle this data was to consider journeys between places, but not the path taken between places. The rational I had was that I did not really care which route the driver took to reach a location, I was more interested in knowing where they were going and coming from. I my personal experience I can take different paths on different days to avoid traffic, but I usually end up at the same place, like work, home, shopping for groceries, etc. I was interested to know where the cars were going, and if I could derive some insights about these habitual places.

Step 1: Making Journeys

While in some cases a journey of a thousand miles can begin with a single step, for me they began with creating some internal rules about what constituted a journey. Given a set of time gapped location information, I applied the following rule:

Journeys are defined to be a gap of 10 minutes in the data records, with a travel distance > 500m based on the odometer change.

The logic behind this rule was to avoid cars that redeploy over short distance (like moving from one side of the street to another to avoid a ticket) or make a short stop in the middle of a journey (like to drop someone off on the way home or stop to fill up for gas). On the other hand, I fully admit they do seem a little arbitrary. Why not 15 minutes? Or 5 minutes? Or greater than 300 meters? Yup, all good points, but I think at the end of the day we are just quibbling about the parameters not the approach. In a perfect world I would probably try a few different parameters ranges to see what sort of convergence would work with the data set. In the rather imperfect world I live in, I had a few days to complete the analysis so I just went with my rule.

At this point then it becomes rather mechanical to create the journeys. As you can see in this R code snippet, the process is:

Calculate the time difference between each observation and store it associated with the most recent observation (I.e. differencei = timei – timei-1).
Iterate through the location data, assign a journey number (starting at 1) until a > 10-minute gap is detected, in which case increment the journey number and assign it to the next point (since it is the start of another journey.

Step 2: Finding places

Now with a list of journeys, the next step is to find the places. The assumption here is that many of the journeys were to and from similar places, even if the absolute location is different. For example, when I go to work, I part in a different parking spot depending on when I arrive, and it could be 100-200 meters apart, but fundamentally I am still at “Work”. And if when I go home and I have to park on the street, well I might end up parking in different location close to my house, but I am still at “Home”.

The approach I took was to consider a density-based clustering of journey start and end locations, to find common places in the journeys. I used DBSCAN (link) algorithm and had to provide values for minimum number of locations in a place cluster, as well as an epsilon (I.e. distance measure between locations) to consider these locations as being the same cluster. My values were 3 points in a cluster, and location points within approximately 550 meters. This value of epsilon generated what (visually) seemed to be a reasonable set of places.

Here is the R code I used:

1) Here the latitude and longitude (lat and lon) of all the journeys start and stop locations is formed into a data frame, and the duplicates (if any) removed. At the end of the process for this car, there were 293 unique locations.

2) Here the call to DBSCAN (comment error…0.005 epsilon for lat/lon is approximate 550-555 meters (link))

3) Here you can see the cluster results. Cluster 0 is the points that were not clustered, and clusters 1 through 15 represent valid clusters or in my case … places. For example, cluster 1 has 162 locations that are considered to be in the same place – which we would calculate as the mean value of latitude and longitude.

For a visual interpretation, you can consider this image where the locations are plotted on a map, and the locations are superimposed as coloured circles (colour coded for cluster).

Now we have the places that this car goes, not just it’s locations. In a future post, I will talk about how to interpret these locations, and try to find ways to detect predictability in their journeys.

Technology Beyond the Hype

Home is where my car stops a lot

Highest frequency clusters

Cluster “0”

Number of clusters

Adding the geographical context

A place is more than a location