In an earlier post I graphed Citi Bike data and daily temperature.
That graph prompted me to take a closer look at the Citi Bike data in order to attempt to answer two questions:
Is temperature really having a statistically significant impact on Citi Bike use?
What other factors influence Citi Bike use?
For this exercise I’ve expanded the dates to include the launch on Memorial Day, May 27th, through December 22nd. I’ve also had the pleasure to work through the data with Alan Salzberg, an expert statistician.
Lets start by reviewing some background information:
- Annual memberships rapidly climbed, then leveled off around the start of September.
- The daily trips and miles also rose rapidly, then dipped around September.
There were 321,307 daily passes and 30,453 weekly passes sold in the time period. If you assume 1.5 trips for each daily pass, and 5 trips for the weekly passes they account for 10.5% of total trips with the other 89.5% by annual members.
What is usage?
What is the best measure of Citi Bike usage for this analysis? After considering daily trips and daily trips per annual member, I’m going to use daily miles traveled. That single number is a good representation of how much the system was used by annual and temporary members.
What affects usage?
Like any human behavior, its very difficult to fully explain and model a multitude of variables acting all at once. Citi Bike usage was no doubt impacted by things that are difficult to quantify including press coverage and individual political views or acceptance of bike share. We are also working with a fairly small data set at 210 days.
For this analysis, we looked at more weather data (humidity, precipitation, cloud cover, pressure, visibility, and wind), whether it was a federal holiday, and the day of the week. Lets see if we can determine which of those variables has a significant impact on Citi Bike usage, and then build a model to test and make predictions.
Is temperature really having a statistically significant impact on Citi Bike use? Here is a box plot of the daily miles traveled by daily mean temperature. The bottom edge of the blue boxes shown represents the day at the 25th percentile of miles traveled for the temperature range shown and the top edge represents the day at the 75th percentile. The line in the middle of the box represents the median. The outside lines represent other non-outlying points and the points drawn at the top and bottom represent outlier days.
It turns out the correlation coefficient between daily miles and mean temperature is 0.65, which indicates moderately strong positive correlation. However, note that once it gets too hot (above 80 degrees), the number of miles declines, so a simple linear correlation does not capture all of the relationship. Also, note the wide range at the 60-70 degree range. These might be days where other factors like rain or wind have a large effect or high variability, resulting in an occasional ideal days for biking (the top point) or some horrible days (the lower points).
What other variables are correlated with the daily miles traveled? One fun way to visualize the correlations between these variables (the ones that are numeric) is a Correlogram, which graphically indicates the positive or negative correlation between any two variables where they intersect.
It is difficult to see the variable names, but if you click the image you'll get a larger version. Based on this Correlogram, and looking closely at the data, we’ll investigate further the day of the week, max temp, min dew point, max wind speed, precipitation, and cloud cover.
We’ve already looked at mean temperature, and the max temperature has the same relationship with a slightly higher correlation (.67).
The day of the week has some effect on miles travelled, but the significant overlap of this box plot suggests that it isn’t the most important variable.
The daily max wind speed has a moderate negative correlation (-0.44) with miles traveled. Once the daily max wind speed gets to about 17 miles per hour, the miles travelled declines noticeably.
Similarly, cloud cover has a moderate negative correlation (-0.41) with miles traveled. On a scale of 0 (no clouds) to 8 (very cloudy), clouds seem to correlate with less rides (of course clouds may be closely linked to precipitation and wind).
Precipitation also has a negative correlation, though it is weaker (-0.36) than other variables. The graph shows that there weren’t many rainy days, but for the two days with rainfall greater than 2 inches there weren’t many rides.
A linear model
Next, let's create a linear model that incorporates each of the variables we’ve identified as significant so we can consider not just the individual relationship to miles traveled, but the combined effect of multiple explanatory variables.
A linear model based on day of week, max temp, precipitation, cloud cover, min dew point, and max wind delivers an r squared of 63%, indicating a decent fit. This graph visualizes the predicted miles in the model vs. the actual miles travelled.
Now that we have a model we can predict future usage. Lets make an estimation for May 27th, 2014, the one year anniversary of the Citi Bike launch. We’ll use the average of the past 10 May 27ths from 2004-2013. That gives us a high of 79 degrees, precipitation of .12 inches, cloud cover of 3.2 (out of 8), min dew point of 50.4F, and max wind of 12.7 mph. Finally, next May 27th falls on a Tuesday.
Our linear model formula looks like this (in miles travelled):
- Intercept: 10,226 miles
- Tuesday: -1161 miles
- Max temp: 1004 miles for every degree F
- Precipitation: -10,212 miles for every inch of rain
- Cloud cover: -1178 miles for every unit of cloud cover
- Min dew point: -274 miles for every degree F
- Max wind: -816 miles for every mph of wind
Based on those values our model predicts 59,149 miles travelled. You heard it here first.
If you'd like to play with the data yourself, this interactive scatter plot visualizes some of the variables discussed in this post.