Histogram of times from a CSV via Pandas

1.1k views Asked by At

I am analysing race results from a CSV which looks like this:

Position,Time,Race #,Batch,Name,Surname,Category,Sex,Age
1,00:25:04,58,E,Luke,Schlebusch,Junior,Male,17
2,00:25:16,92,E,Anrich,Zimmermann,Junior,Male,17
3,00:26:27,147,E,Ryan,Mathaba,Open,Male,33
4,00:26:58,53,E,Daniel,Rademan,Junior,Male,16
5,00:27:17,19,E,Werner,Du Preez,Open,Male,29
6,00:27:44,148,E,Mazu,Ndandani,Open,Male,37
7,00:27:45,42,E,Dakota,Murphy,Open,Male,20
8,00:28:29,56,E,David,Schlebusch,Master,Male,51
9,00:28:32,52,E,Caleb,Rademan,Minimee,Male,12

I am using the following call to read_csv to parse this into a Pandas dataframe:

race1 = pandas.read_csv('data.csv', parse_dates='Time', index_col='Time')

This enables me to plot a cumulative distribution of race times very easily by just doing:

race1.Position.plot()

Pandas handles all the intricacies of the date data type and makes a nice x axis with proper formatting of the times.

Is there an elegant way of getting a histogram of times which is similarly straightforward? Ideally, I would like to be able to do race1.index.hist() or race1.index.to_series().hist(), but I know that doesn't work.

I've been able to coerce the time to a timedelta and get a working result with

times = race1.index.to_series()
((times - times[0]).dt.seconds/60).hist()

This produces a histogram of the correct shape, but obviously with wrong x values (they are off by the fastest time).

Is there an elegant way to read the column as a timedelta to begin with, and is there a better way of creating the histogram, including proper ticks? Proper ticks here mean that they use the correct locator and updates properly.

1

There are 1 answers

1
chthonicdaemon On BEST ANSWER

This appears to work pretty well, although I would be happier with it if it didn't go through the Matplotlib date specifics regarding ordinal dates.

times = race1.index.to_series()
today = pandas.Timestamp('00:00:00')
timedelta = times - today
times_ordinal = timedelta.dt.seconds/(24*60*60) + today.toordinal()
ax = times_ordinal.hist()
ax.xaxis_date()
plt.gcf().autofmt_xdate()
plt.ylabel('Number of finishers')

Sample of output