python and pandas for annual hourly averages over many years for NOAA rainfall data

1.5k views Asked by At

I am new to stack overflow and pandas, but I appreciate this platform and have an interesting question: I have a pandas data frame that takes the NOAA rainfall data (in csv format for the hours that have rainfall, various years yet sequential, some data missing), replaces the NaNs with zeros, and makes a clean hourly data file for our water/plumbing engineers for all of the years available from NOAA (varies completely). However, the engineer would like one 8760 (the number of hours in a non-leap year) hourly data file that is the average of each hour from each of the years available from NOAA.

For example, I have hourly NOAA data from 1:00AM July 1, 1987 to 12:00AM December 31st, 2001; I make a huge hourly df, but now I need to make an annual 8760 hourly df with the average from each hour of the year (the average from January 1st at 1:00am across all years, the average from January 1st at 2:00am across all years, ..., the average from December 31st at 12:00am across all years) bearing in mind the start of the data AND the leap years! Any insight how to do this successfully?

1

There are 1 answers

0
closlas On

Pandas is great for these kind of things. What you need to do is:

  1. Create a column in your df with the month, day,hour of the datetime column
  2. Use the groupby method to create a mapping of grouped rows
  3. Calculate the mean of these groups

Here is a snippet which creates a dummy dataset & calculates the mean of each group:

import pandas as pd
import numpy as np

#creating some dummy data
n_years = 3
n_hours = 3
st_times = ['01-01-198{0} 00:00'.format(i) for i in range(n_years)]
nd_times = ['01-01-198{0} 0{1}:00'.format(i,n_hours-1) for i in range(n_years)]

indx_list = []
for s, e in zip(st_times, nd_times):
    indx = pd.date_range(start=s, end=e, freq='H')
    indx_list.append(indx.values)
index = pd.DatetimeIndex(np.concatenate(indx_list,axis=0))

data = pd.DataFrame({'rainfall': list(range(n_years*n_hours)),
              'rainfall_1': list(reversed(range(n_years*n_hours)))
             }, index=index)

#creating the hour, day, month, & day columns
data.loc[:,'hour'] = data.index.hour.values
data.loc[:,'day'] = data.index.day.values
data.loc[:,'month'] = data.index.month.values

#create groups and calculate the mean of each group
data.groupby(['month','day','hour']).mean()