Most of the time I work with data sets on projects I have location information. Either directly reported – like GPS coordinates, or implied – as with phone users connections to mobile cell sites. How to incorporate this location data into a model varies a lot by what sort of model we are trying to develop, but I want to share with you one example to show you the sort of insights that can be achieved and how to do it.
In one project we had access to location information from a set of cars. The data set was relatively limited; we only had 1.5-2 months data from each car. On the other had the data was rather voluminous; the car data was reported every second including the car’s current location. The data was only reported when the car was turned on (sort of obvious, but worth mentioning), so we could look at the data as being associated with the movement of a car over these 1.5-2 months.
The approach I took to wrangle this data was to consider journeys between places, but not the path taken between places. The rational I had was that I did not really care which route the driver took to reach a location, I was more interested in knowing where they were going and coming from. I my personal experience I can take different paths on different days to avoid traffic, but I usually end up at the same place, like work, home, shopping for groceries, etc. I was interested to know where the cars were going, and if I could derive some insights about these habitual places.
Step 1: Making Journeys
While in some cases a journey of a thousand miles can begin with a single step, for me they began with creating some internal rules about what constituted a journey. Given a set of time gapped location information, I applied the following rule:
- Journeys are defined to be a gap of 10 minutes in the data records, with a travel distance > 500m based on the odometer change.
The logic behind this rule was to avoid cars that redeploy over short distance (like moving from one side of the street to another to avoid a ticket) or make a short stop in the middle of a journey (like to drop someone off on the way home or stop to fill up for gas). On the other hand, I fully admit they do seem a little arbitrary. Why not 15 minutes? Or 5 minutes? Or greater than 300 meters? Yup, all good points, but I think at the end of the day we are just quibbling about the parameters not the approach. In a perfect world I would probably try a few different parameters ranges to see what sort of convergence would work with the data set. In the rather imperfect world I live in, I had a few days to complete the analysis so I just went with my rule.
At this point then it becomes rather mechanical to create the journeys. As you can see in this R code snippet, the process is:
- Calculate the time difference between each observation and store it associated with the most recent observation (I.e. differencei = timei – timei-1).
- Iterate through the location data, assign a journey number (starting at 1) until a > 10-minute gap is detected, in which case increment the journey number and assign it to the next point (since it is the start of another journey.
Step 2: Finding places
Now with a list of journeys, the next step is to find the places. The assumption here is that many of the journeys were to and from similar places, even if the absolute location is different. For example, when I go to work, I part in a different parking spot depending on when I arrive, and it could be 100-200 meters apart, but fundamentally I am still at “Work”. And if when I go home and I have to park on the street, well I might end up parking in different location close to my house, but I am still at “Home”.
The approach I took was to consider a density-based clustering of journey start and end locations, to find common places in the journeys. I used DBSCAN (link) algorithm and had to provide values for minimum number of locations in a place cluster, as well as an epsilon (I.e. distance measure between locations) to consider these locations as being the same cluster. My values were 3 points in a cluster, and location points within approximately 550 meters. This value of epsilon generated what (visually) seemed to be a reasonable set of places.
Here is the R code I used:
1) Here the latitude and longitude (lat and lon) of all the journeys start and stop locations is formed into a data frame, and the duplicates (if any) removed. At the end of the process for this car, there were 293 unique locations.
2) Here the call to DBSCAN (comment error…0.005 epsilon for lat/lon is approximate 550-555 meters (link))
3) Here you can see the cluster results. Cluster 0 is the points that were not clustered, and clusters 1 through 15 represent valid clusters or in my case … places. For example, cluster 1 has 162 locations that are considered to be in the same place – which we would calculate as the mean value of latitude and longitude.
For a visual interpretation, you can consider this image where the locations are plotted on a map, and the locations are superimposed as coloured circles (colour coded for cluster).
Now we have the places that this car goes, not just it’s locations. In a future post, I will talk about how to interpret these locations, and try to find ways to detect predictability in their journeys.