I know there have been a few posts on this, but my case is a little bit different and I wanted to get some help on this.
I have a pandas dataframe symbol_df with 1 min bars in the below format for each stock symbol:
id Symbol_id Date Open High Low Close Volume
1 1 2023-12-13 09:15:00 4730.95 4744.00 4713.95 4696.40 2300
2 1 2023-12-13 09:16:00 4713.20 4723.70 4717.85 4702.55 1522
3 1 2023-12-13 09:17:00 4716.40 4718.55 4701.00 4701.00 909
4 1 2023-12-13 09:18:00 4700.15 4702.80 4696.70 4696.00 715
5 1 2023-12-13 09:19:00 4696.70 4709.90 4702.00 4696.10 895
... ... ... ... ... ... ... ...
108001 1 2024-03-27 13:44:00 6289.95 6291.95 6289.00 6287.55 989
108002 1 2024-03-27 13:45:00 6288.95 6290.85 6289.00 6287.75 286
108003 1 2024-03-27 13:46:00 6291.25 6293.60 6292.05 6289.10 1433
108004 1 2024-03-27 13:47:00 6295.00 6299.00 6293.20 6293.15 2702
108005 1 2024-03-27 13:48:00 6292.05 6296.55 6291.95 6291.95 983
I would like to calculate the "Relative Volume Ratio" indicator and add this calculated value to the symbol_df as a new column on a rolling basis.
"Relative volume ratio" indicator calculated as below:
So far today's Volume is compared with the mean volume of the last 10 days of the same period. To get the ratio value, we simply divide "today so far volume" by "mean volume of the last 10 days of the same period".
For example..the current bar time is now 13:48.
cumulativeVolumeOfToday = Volume of 1 minuite bars between 00:00 -13:48 today added up
avergeVolumeOfPreviousDaysOfSamePeriod = Average accumulation of volume from the same period(00:00 - 13:48) over the last 10 days.
relativeVolumeRatio = CumulativeVolumeOfToday/AvergeVolumeOfPrevious10DaysOfSamePeriod
Add this value as a new column to the dataframe.
Sample data download for the test case:
import yfinance as yf #pip install yfinance
from datetime import datetime
import pandas as pd
symbol_df = yf.download(tickers="AAPL", period="7d", interval="1m")["Volume"]
symbol_df=symbol_df.reset_index(inplace=False)
#symbol_df['Datetime'] = symbol_df['Datetime'].dt.strftime('%Y-%m-%d %H:%M')
symbol_df = symbol_df.rename(columns={'Datetime': 'Date'})
#We can only download 7 days sample data. So 5 days mean for calculations
How can I do this in Pandas?
TL;DR
Explanation
Based on the provided description, you need to perform several transformations on the aggregated data. First is to cumulatively summarize the data for each day. Then run a [ten]-day window over the data grouped by time of day to calculate the average. And at the end, actually divide the former by the latter.
Let's say, you have the following test data, where
"Date"is a column of typedatetime:To calculate the Relative Volume Ratio values, we will use
"Volume"as a separate sequence with date-time stamps"Date"as its index:Let's create a sequence of cumulative volumes for each day. For this, we group
volumeby its date (the year, month and day values with no time) and applycumsumto a group (usesort=Falsein hopes to speed up calculations):To calculate the mean of cumulative volumes at the same time of day in the given number of previous days, we group
cum_volumeby its time (hours and minutes with no year, month, day values), and apply rolling calculations to each group to obtain averages over windows. Note that here we need the source data to be sorted by date-time stamps since only business days are taken into account and we can't use a non-fixed frequency of"10B"as awindowvalue. To calculate means for exactly the previous days excluding the current one, we passclosed='left'(see DataFrameGroupBy.rolling docs for details):Now the final touch with the window of 5 days:
Comparison
Compared to Andrei Kesely's solution, this one wins in speed (on Intel Core i3-2100, for example, processing the data offered there will take over 1 minute versus 300-400 ms with the code above). The calculation result is the same for timestamps after the first 10 days. But in the beginning, when there's less then 10 previous days, calculation of mean in rolling windows is made as if there's always 10 items (missing values are set to nan). Whereas in the case of the Kesely's solution, we obtain average values only for the available cumulative volumes.