In the previous blog post (link) I talked about how to interpret places from journey records, proposing a visualization based means to identify key places (“home”, “work”) and then to do a validation using common mapping services like Google. In this post I am going to explore analytic insights on the place information to address questions of the data related to likelihood of journeys and their habitual nature.
Preparing the data
The data set I used for the place frequency analysis until now was basically a list of journeys with start and stop places, with time and date information. A snippet is shown below:
Next I want to understand how often during the sample period the car in question went from one place to another. The data set contains actual journeys, but what is missing is the days and times there were no journeys. For example, if a car went from place 1 to place 2 on Monday 3 times in the data set, I don’t immediately know how habitual that is that is unless I know how many “Mondays” were present in the sample time-frame.
What I do know at this point is that depending on the car, I have about 1.5 – 2 months of data — between 6 and 8 weeks — which when considering insights into journey habits maybe not be a lot. My instincts at this point are to first consider coarse level temporal patterns – like daily events – before moving to more granular analysis – like time related hourly or time of day patterns.
I processed the actual journeys to expand the non-travel dates and also sum the number of journeys between places on each day of the week. You can see a snippet of the file below, focusing on Monday and Friday journeys for a particular car:
What you can see is that (on row 2) on “Mondays” there were a total of 13 journeys from place 3 to place 3 that occurred over 4 days (journey.days) and 2 days (non.journey.days) when there was no travel. If you add the jouney.days and non.journey.days features together you can figure out that there was a total of 6 “Mondays” in my sample set (i.e. 6 weeks of data). And you can also see (on row 64) that there were no journeys on Mondays between place 1 to place 1.
After expanded data set of journeys was built, I created 2 3-dimensional matrices:
- Total number of trips per day from one place to another (journey.count.x)
- Number of days in the sample at least 1 trip occurred per day from one place to another (pred.journey)
The matrix structure for each is:
(journey start place — “from” X journey end place — “to” X day of the week)
(total number of places X total number of place X 7)
I have included the code snippets at the end of the blog to see the necessary transformations to make these matrices from the data frame above.
So now the fun can begin! Let’s consider some questions related to temporal journey patterns
Where does the car usually go?
This is really two questions hiding inside one:
- What journeys does the car take on a regular basis?
- What journeys does the car NOT take on a regular basis?
Both of these questions related to habitual patterns in the data, in contrast to non-habitual patterns (i.e. unpredictable patterns). Both babitual and non-habitual behaviour are interesting to know, since this will allow us to form quantitative statements on likelihood of journeys on certain days.
Take an example: Assume a particular car has gone from place 3 to place 6 five dates out of six on Tuesdays. This seems to a “usual” journey, but how usual is it?
This question can be formulated as a probabilistic hypothesis that we can use the sample data to resolve. Given we have trial information (5 out of 6 occurrences), this sounds like an application of binomial probability, but in this case we don’t know the probability (p) of the traveling from location 3 to 6. However, what we can do is calculate the probability of getting 5/6 occurrences with different p values and examine the likelihood that certain p values are “unlikely”. In this case I have rather arbitrarily set “unlikely” as less than 10%, meaning I want to have a 90% level of likeliness that my p value is correct.
To be a bit more precise, I want to find the value of p (pi) such that I can reject the following (null) hypothesis at 90% level.
Journeys between place 3 and place 5 on day Tuesdays are follow a binomial distribution with p value pi
Below are the probability values for different of p and different number of journeys out of a maximum of 6.
We can see that for 5 journey days out of 6 (“5T”), the p value (“p”) of 0.50 has a probability of 0.094, and a p value of 0.55 has a probability of 0.136. If the value of pi in the above hypothesis was 0.50, I could reject the hypothesis since there is only a 9.4% chance I can have 5/6 journeys with that p value. In fact, I can reject the hypothesis for all value less than 0.50 as well, because their probabilities are also lower than 10%. What I am left with is the statement that journeys between place 3 and 6 on Tuesdays is “usual”, where the p value is greater than 0.55, or in other words there is a >55% chance that a car will go from place 3 to place 6 on Tuesdays.
What about trips not taken? In our data set, there are lots of non-journeys on days (i.e. days of the week with no journeys). For any location pair on any day with no journeys (0 out of 6 – column “0T”), we can see that p value of 0.35 or less cannot be rejected at the 90% level of likeliness. So, in our case “unusual” means a p value of less than 0.35.
One take away from this analysis is that a sample of 6 weeks does not give us a lot information to make really significant statements on the data. Even with no observed travel between 2 places in a day we can only be confident that the probability of travel between them is less than 35%. If we had 10 weeks of information (see below for a table I worked out with 10 weeks journey info), then the binomial probability of taking no trips over 10 weeks would be less than 0.25 at a 90% confidence level. Similarly, if there was a journey 9 out of 10 weeks we could side the binomial probability would be greater than 0.8 at a 90% confidence level.
A bit unsatisfuing, but having some data is better than having no data, and being able to quantify our intuition about “usual” into a probabilistic statement is something.
In terms of scripting, R provides some simple ways to extract the relevant car information from our journey count matrix once they have been developed — arguably better than the original data frame.
To find all the days and journeys that have 4, 5 or 6 journeys, we can use the syntax:
which(pred.journey.x > 3, arr.in=TRUE)
The result is show below, where “dim1” is the row, “dim2” is the column and “dim3” is the day from the original matrix.
For example, we can see that on Mondays (“dim3” = 1) the car had more than 3 journey days out of 6 between places 3 to 3, 10 to 3 and 3 to 10. Similarly, on Tuesdays there were 4 or more journey days between 3 to 6 and 6 to 7.
You can play with the conditional statement in the “which” command to select exact journey values or other conditional queries you can imagine.
You can also dig a little deeper into the journey days for specific locations or days. Here is a command to show all the journey days per day of the week that occurred between location 3 and 6.
Are there places that the car goes to frequently?
On examining the data, I could see that for most day and journey pairs, there is 1 or 0 journeys. However, there are a few places and days where there are more frequent journeys. Below is a report (code at bottom to generate it) that shows the journey locations and days for a specific car (“dim1”, “dim2”, “dim3”) that had more journeys per day, with journeys count (“journey.count”) and days traveled (“journey.days”) and the ratio of journeys to days (“ratio.v”)
What you can observe is that on Mondays (“dim3” = 1) there are “some regular trips” (with trip probability of between 0.20 and 0.85 based on the fact 4 /6 weeks there was travel) from 3 to 3 (ratio 3.25) and 10 to 3 (ratio 1.25).
While this has been an interesting exercise, the less than satisfying part for me is the lack of profound (even relatively profound) statements that can be made on the data, given the relatively little amount of information available. However, in the land of the blind the one-eyed man is king, so some information is better than none, and the ability to put some precision to our intuition of patterns I think is valuable.
If I had more data, I would have liked to expand the analysis to cover time as well, both hourly or time of day (e.g. morning, afternoon, evening). The structure of the analysis matrices would be the same, except I would have to add another dimension for time or time of day (thus making a 4-dimensional matrix). Based on the hypothesis testing I have done with the days, I would think much more than 10 weeks data should be sufficient to start considering this level of detail.
Happy location playing!
Creating the location and frequency matrices
Script for multilocation report