Resampling pandas DataFrame for every nth row with different parameters on each column?

807 views Asked by At

I've got minute by minute stock data with col_headings Date, Time, Open, High, Low, Close, Volume. I need to resample for every n'th row (in this example every 3rd row), but with different parameters on several columns. The code I have so far is:

import pandas as pd
import numpy as np

spy = pd.read_csv("C:\\Users\\PC\\Desktop\\spy_test.csv")

#Sample for every n minutes
n = 3
b = n-1


spy_date = pd.DataFrame(spy['Date'])
date = spy_date.iloc[b::n, :]

spy_time = pd.DataFrame(spy['Time'])
time = spy_time.iloc[b::n, :]
time = time.reset_index(drop=True)
spy_open = pd.DataFrame(spy['Open'])
open = spy_open.iloc[::n, :]
open = open.reset_index(drop=True)

spy_high = pd.DataFrame(spy['High'])
high_s = spy_high.iloc[::n, :].max()
high = pd.DataFrame(high_s)
high = high.reset_index(drop=True)

spy_low = pd.DataFrame(spy['Low'])
low_s = spy_low.iloc[::n, :].min()
low = pd.DataFrame(low_s)
low = low.reset_index(drop=True)

spy_close = pd.DataFrame(spy['Close'])
close = spy_close.iloc[::n, :]
close = close.reset_index(drop=True)

spy_volume = pd.DataFrame(spy['Volume'])
volume_s = spy_volume.iloc[n::3, :].sum()
volume = pd.DataFrame(volume_s)
volume = volume.reset_index(drop=True)

joined = [date, time, open, high, low, close, volume]

result = pd.concat(joined, axis=1)
result.columns = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']

print(result)

This code worked except that the High, Low, and Volume columns only returned a single value. Where I wanted the max/min/sum of each 3 minute block, it returned the high/low/sum for the entire column.

Also, if you know of an easier way to do this that a beginner would more or less understand, I am all ears. I've only been coding a couple weeks so I really have no clue what I'm doing.

Note: I considered using qcut but as far as I can tell I would need to figure out how many bins I would want ahead of time. As I'm going to be running varying sets of data through this same basic format it seemed less than ideal.

1

There are 1 answers

5
Ted Petrou On BEST ANSWER

Since you have military time convert it to timedelta with the following first. Also you will likely have to add leading zeros since you have integers. I have done this with zfill.

df['Time'] = pd.to_timedelta(pd.to_datetime(df.Time.map(lambda x: str(x).zfill(4)),format='%H%M').dt.time.astype(str))
df.set_index('Time').groupby(['Date', pd.Timegrouper('3T')]).agg({'Open':'last', 'High':'max', 'Low':'min', 'Close':'last','Volume':'sum'})