In the previous blog post (link) I talked about how to convert location data into places information based on a density based clustering of the journey start and end locations from a set of cars\vehicles over a two month period. In this post I am going to explore what information we can conclude from the place information.
Being the good data scientists that we all are, the natural thing to reach for when examining some data for the first time is usual a histogram. For both numeric and categorical data a histogram representation usually gives some insights about spread of the data and the density or frequency with respect to range bins or categories. In the image below, you can see histograms of the place clusters for 3 cars with the y-axis representing the percentage of the journeys that either had a stop or start in a place cluster.
For these cars, we can get some interesting insights from these histograms.
Highest frequency clusters
In all the histograms (and in fact for the majority of the cars I studied) there was one cluster place that had the majority of the start and stop. For all the cars shown above this is cluster 1. Now if this was my personal car, I would call this place “home” — where I live. For service vehicles (like delivery vans) this may be the store or factory. For rental cars, this could be the rental car agency location. Whatever the situation, it is the place where the car ends up most of the time. There are other peaks as well, like cluster 2 for car 1. If this was a residential car and I think about my own behaviour, I would call this place “work”. Of course this analysis is speculative unless there is some additional data we can bring in, which I will talk about later on.
Recall the previous blog that place cluster “0” represents the locations that were not associated with any other cluster, nor could form themselves into a cluster. Any car that a lot of cluster “0” journey starts and stops does not spend a lot of time going to the same places. They are not predictable, at least with respect to their journeys.
Compare car 1 and car 2. Car 1 has ~8% of its journeys starting or stopping in non-clustered areas. Car 2 has ~23% outside of clusters. With respect to car 1, car 2 seems less predictable in its driving habits.
Below are two rather pathological examples – cars 19 and 20.
Both of these cars have only 1 cluster (cluster 1), and many or most (in the case of car 20) of their journey start and stops are in unique location – or at least greater than 550 meters away from each other. Without any further insights I would venture they are taxis or possibly rented cars.
Number of clusters
Cars with more clusters go to more places than cars with less clusters – which may seem a rather obvious statement. But another way to consider that information is that a car with less clusters is probably more predictable in the journey origins and destinations, since they go to less places. Consider car 1 and car 3 from figure 1. Approximately 75% of car 1’s journeys start or end in cluster 1 or 2. For car 3, the two most common clusters are 1 and 9, but associated number of starts and ends is only ~48%. If I was looking to find car 1 at any point of time – and without any more information – I would likely find it in place cluster 1 or 2.
Another interesting thing to consider is how the number of clusters relates to the size of cluster “0”. Consider car 2 and car 3 in figure 1; car 2 has more starts and stops outside of common locations than car 3, but car 3 has more clusters (13 versus 9). One could conclude that car 3 “usually” goes to many of the same places, whereas car 2 goes to many different places and less usual places. I have no ground truth here, but car 2 seems more like a taxi whereas car 3 looks more like a delivery van going on a regular delivery route.
Adding the geographical context
If we add a geographical information to the histogram information, we can start to make more inferences about a car’s patterns of journeys. By considering the actual places using a mapping service (like Google maps) and street visualization (like Google Streetview) additional information can be obtained to enrich the place histogram analysis.
Consider the mapping of car 23 places onto Google maps below
For privacy reasons I have removed the actual location of place 1 and place 2, but I can say that place 1 is a residential location and place 2 was an industrial location. Given the nature of the locations and the fact that place cluster 1 has more start/stops that cluster 2, and the remaining clusters are relatively small, it seems reasonable to call place cluster 1 as “home” and cluster 2 as “work”.
Powerful insights indeed, but somewhat limited by human analysis effort (I.e. considering the relative cluster sizes, looking at the address in Google, going into Streetview, etc). For 1 or 2 or even 20 cars this is manageable, but as part of an automated process, this approach is a bit lacking. I did not explore it but possibly the location enrichment I did with Google can be automated via APIs, nor did I try to make some rules related to histogram analysis, so that that is a possibly something to follow up on in a future project.
In the next blog I go beyond this visualization approach to consider numeric approaches to describe and (hopefully) make some probability statements on journey destinations and origins. The questions I want to get to are:
“If a car is in a particular place at a particular time and day, what is the most likely place it is going to next and when?”
“Is it unusual to have a particular car in this place at this time and day?”