### Looking at (small) data from five years of driving

I finally got rid of my car earlier this year. It served me well, but I currently live in the city and take the subway most of the time. Now that it's gone, it's time to take a look at the data.

From when I bought the car used in 2009 until I sold it this past summer, I made a log entry every time I refueled the car. (I realize this is not normal.)

The very last page... :.(

I noted the date, mileage on the odometer, number of gallons of gas I bought, and the price of gas. Every now and then I'd remember to bring in the log and update my spreadsheet on Google Docs. Here's what the data looks like after being downloaded as a csv.

In [1]:
!head data/jeep.csv

"date","miles","gal","price"
"5/8/2009","35,666",15.9,2.09
"5/8/2009","36,052",17,2.05
"5/9/2009","36,433",17.1,2.19
"5/16/2009","36,774",16.7,2.45
"5/22/2009","37,095",16.1,2.45
"6/7/2009","37,434",17.1,2.59
"6/19/2009","37,774",17.2,2.69
"6/21/2009","38,179",17.6,2.49
"6/23/2009","38,561",17.3,2.64


Let's read it in and get going.

In [2]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

import numpy as np
import pandas as pd

df = pd.read_csv('data/jeep.csv', parse_dates=['date'])

Out[2]:
date miles gal price
0 2009-05-08 35,666 15.9 2.09
1 2009-05-08 36,052 17.0 2.05
2 2009-05-09 36,433 17.1 2.19
3 2009-05-16 36,774 16.7 2.45
4 2009-05-22 37,095 16.1 2.45
In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 170
Data columns (total 4 columns):
date     171 non-null datetime64[ns]
miles    171 non-null object
gal      171 non-null float64
price    171 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 6.7+ KB


Looks like the miles column was read in as strings (its dtype is object), so we just need to fix that:

In [4]:
# remove commas and cast to int type
df.miles = df.miles.str.replace(',', '').astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 171 entries, 0 to 170
Data columns (total 4 columns):
date     171 non-null datetime64[ns]
miles    171 non-null int64
gal      171 non-null float64
price    171 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 6.7 KB


## Miles driven¶

Let's look at the mileage over time. I bought the car used, so the mileage starts near 40,000.

In [5]:
# use friendly commas for thousands
from matplotlib.ticker import FuncFormatter
commas = FuncFormatter(lambda x, p: format(int(x), ','))

plt.plot(df.date, df.miles)
plt.xlabel('date')
plt.ylabel('mileage')
plt.gca().get_yaxis().set_major_formatter(commas)
plt.show()


## Mileage per gallon¶

One quantity of interest to car buyers is fuel efficiency. Without knowing much about cars, I'd assume that as components age the mileage per gallon (MPG) will decline. Let's see if that's the case. The first thought on how to do this would be

$$\frac{\textrm{miles driven since last fill-up}}{\textrm{gallons purchased last time}}$$

but the fill-ups came at irregular intervals and weren't always up to a full tank, so these fractions could be all over the place. Instead, we can divide the cumulative sum of miles driven by the cumulative amount of gasoline purchased. The first few numbers will be crazy but it will pretty quickly converge to a reasonable estimate.

In [6]:
# add a miles per gallon series
df['mpg'] = df.miles.diff().cumsum() / df.gal.cumsum()

# plot the points
plt.scatter(df.miles, df.mpg, c='#30a2da')

# plot the exponentially weighted moving average of the points
plt.plot(df.miles, pd.ewma(df.mpg, 5), lw=2, alpha=0.5, c='#fc4f30')

# ignore the first handful of these
plt.xlim(45000, df.miles.max())
plt.ylim(17, 21)

plt.xlabel('mileage')
plt.ylabel('mpg')
plt.show()


A couple of interesting notes here:

• The dots that are unusually high probably correspond with long highway drives.
• As for the discontinuity around 54,000 miles, I figured out by looking through my car folder that this was when I replaced the tires. It's possible that the new tires had higher drag causing lower fuel efficiency.

## Fitting a simple model¶

It could be interesting to quantify how fuel efficiency changed as the car aged.

In [7]:
import statsmodels.formula.api as smf

# create a new view into the old dataframe for fitting
# (throw away the first 30 points)
model_df = df.ix[30:, ['miles', 'mpg']].copy()

# divide the new miles by 10k so the regression coefficients aren't tiny
model_df.miles /= 1e4

# fit the model
results = smf.ols('mpg ~ miles', data=model_df).fit()

# print out results
results.summary()

Out[7]:
Dep. Variable: R-squared: mpg 0.943 OLS 0.942 Least Squares 2285. Thu, 25 Dec 2014 3.55e-88 20:26:13 101.49 141 -199.0 139 -193.1 1
coef std err t P>|t| [95.0% Conf. Int.] 22.1292 0.060 368.805 0.000 22.011 22.248 -0.4399 0.009 -47.807 0.000 -0.458 -0.422
 Omnibus: Durbin-Watson: 0.313 0.784 0.855 0.124 0.063 0.94 3.072 40.1

It looks like my fuel efficiency went down about 0.44 miles/gal for every 10,000 miles of wear and tear.

## Miles driven per year compared to US average¶

First, we'll sum up the data grouped by year.

In [8]:
driven_per_year = df.set_index('date').miles.diff().resample('1A', how='sum')
driven_per_year

Out[8]:
date
2009-12-31    11303
2010-12-31    14645
2011-12-31     8935
2012-12-31     7206
2013-12-31     3503
2014-12-31     1487
Freq: A-DEC, Name: miles, dtype: float64

Next, we'll grab annual driving stats from the Federal Highway Administration, an agency of the U.S. Department of Transportation.

In [9]:
# use read_html to get the <table> element as a DataFrame
us_avg_miles = parsed_tables.pop().set_index('Age')
us_avg_miles

Out[9]:
Male Female Total
Age
16-19 8206 6873 7624
20-34 17976 12004 15098
35-54 18858 11464 15291
55-64 15859 7780 11972
65+ 10304 4785 7646
Average 16550 10142 13476
In [10]:
driven_per_year / us_avg_miles.ix['20-34', 'Male']

Out[10]:
date
2009-12-31    0.628783
2010-12-31    0.814697
2011-12-31    0.497052
2012-12-31    0.400868
2013-12-31    0.194871
2014-12-31    0.082721
Freq: A-DEC, Name: miles, dtype: float64

So it looks like I drive less than average. This makes sense — from 2008 until mid-2010, I spent about half the year at sea with my car sitting on the pier. In 2010, I moved back to Boston and my primary mode of transportation was walking or taking the T.

## Gas prices¶

Even without driving that much, it's possible to spend quite a bit on gas.

In [11]:
plt.plot(df.date, (df.gal * df.price).cumsum(), label='cumulative dollars spent on gas')
plt.plot(df.date, df.gal.cumsum(), label='cumulative gallons gasoline bought')

plt.xlabel('date')
plt.gca().get_yaxis().set_major_formatter(commas)

plt.legend()
plt.show()


Thanks to the Energy Information Administration, an entity within the U.S. Department of Energy, it's also possible to compare the prices at which I bought gas to the national average price over time.

In [12]:
!wget -O data/gas.xls http://www.eia.gov/dnav/pet/xls/PET_PRI_GND_DCUS_NUS_W.xls

--2014-12-25 20:26:15--  http://www.eia.gov/dnav/pet/xls/PET_PRI_GND_DCUS_NUS_W.xls
Resolving www.eia.gov (www.eia.gov)... 205.254.135.7, 2607:f368:1000:1001::1007
Connecting to www.eia.gov (www.eia.gov)|205.254.135.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 328192 (320K) [application/vnd.ms-excel]
Saving to: ‘data/gas.xls’

100%[======================================>] 328,192      419KB/s   in 0.8s

2014-12-25 20:26:15 (419 KB/s) - ‘data/gas.xls’ saved [328192/328192]



It's an Excel file, unfortunately, but pandas makes it easy to get at the data. I opened Libreoffice just to see how it's laid out, then read it in with the read_excel method.

In [13]:
gas = pd.read_excel('data/gas.xls', sheetname='Data 1', header=2)
gas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1271 entries, 0 to 1270
Data columns (total 16 columns):
Date                                                                                       1271 non-null datetime64[ns]
Weekly U.S. All Grades All Formulations Retail Gasoline Prices  (Dollars per Gallon)       1134 non-null float64
Weekly U.S. All Grades Conventional Retail Gasoline Prices  (Dollars per Gallon)           1048 non-null float64
Weekly U.S. All Grades Reformulated Retail Gasoline Prices  (Dollars per Gallon)           1048 non-null float64
Weekly U.S. Regular All Formulations Retail Gasoline Prices  (Dollars per Gallon)          1265 non-null float64
Weekly U.S. Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)              1265 non-null float64
Weekly U.S. Regular Reformulated Retail Gasoline Prices  (Dollars per Gallon)              1048 non-null float64
Weekly U.S. Midgrade All Formulations Retail Gasoline Prices  (Dollars per Gallon)         1048 non-null float64
Weekly U.S. Midgrade Conventional Retail Gasoline Prices  (Dollars per Gallon)             1048 non-null float64
Weekly U.S. Midgrade Reformulated Retail Gasoline Prices  (Dollars per Gallon)             1048 non-null float64
Weekly U.S. Premium All Formulations Retail Gasoline Prices  (Dollars per Gallon)          1048 non-null float64
Weekly U.S. Premium Conventional Retail Gasoline Prices  (Dollars per Gallon)              1048 non-null float64
Weekly U.S. Premium Reformulated Retail Gasoline Prices  (Dollars per Gallon)              1048 non-null float64
Weekly U.S. No 2 Diesel Retail Prices  (Dollars per Gallon)                                1084 non-null float64
Weekly U.S. No 2 Diesel Ultra Low Sulfur (0-15 ppm) Retail Prices  (Dollars per Gallon)    412 non-null float64
Weekly U.S. No 2 Diesel Low Sulfur (15-500 ppm) Retail Prices  (Dollars per Gallon)        96 non-null float64
dtypes: datetime64[ns](1), float64(15)
memory usage: 168.8 KB


From here, we'll narrow the national gas data down to just the part we care about (regular gasoline) and plot it alongside what I paid.

In [14]:
# slice off the data we care about
regular = gas[['Date', 'Weekly U.S. Regular All Formulations Retail Gasoline Prices  (Dollars per Gallon)']]
regular.columns = ['date', 'usa_avg_price']

# plot the two trends
plt.plot(df.date, df.price, label='dollars per gallon paid')
plt.plot(regular.date, regular.usa_avg_price, lw=2, label='national avg')

# annotate the figure
plt.xlabel('date')
plt.ylabel('gas price per gallon (USD)')
plt.xlim(df.date.min(), df.date.max())

plt.legend()
plt.show()


I was usually filling up in New England which is probably a little more expensive than the national average.

## In closing¶

Not exactly the most hard hitting blog entry of all time, but I spent a bunch of time over the years writing it all down so... may as well make some plots!

Aside: for anyone considering keeping a log in your car, keep in mind that you will take a lot of flak from passengers. Think about it.

Any comments or suggestions? Let me know.