Optimization of filter & loops to run on 1M rows

179 views Asked by At

I am trying to run a query with multiple filters on a data frame

Works like a charm on my small sample (below) but takes a lot of time as data increases.

import pandas as pd

df=pd.DataFrame({'ID': [FACL01, FACL02, FACL03, FACL01, FACL04, FACL06, FACL07, 
                        FACL08, FACL09, FACL01, FACL11, FACL12], 
                 'AMOUNT': [10, 20, 30, 40, 50, 60, 70, 80, 20, 10, 30, 10], 
                 'DATE': [20201503, 20201503, 20201503, 20201502, 20201503, 20201502, 
                          20201501, 20201503, 20201503, 20201501, 20201503, 20201502]})
df[AVG_AMOUNT]=0

%%time
for idx, x in df['ID'].iteritems():
df.loc[idx, 'AVG_AMOUNT']=(df[(df['DATE'].isin(M1)) & (df.ID==x)]['AMOUNT'].mean())

I am trying to get average of all AMOUNT within 3 month period (M1) for a particular ID to fill in AVG_AMOUNT.

1

There are 1 answers

4
Juan C On BEST ANSWER

I modified your data a bit, because you provided as many ID's as rows, which would make rolling means futile. I reduced it to 2 IDs, but the rest is the same:

df=pd.DataFrame({'ID': ['FACL01', 'FACL01', 'FACL01', 'FACL01', 'FACL04',
                        'FACL04', 'FACL04', 'FACL04', 'FACL04', 'FACL04'
                        , 'FACL04', 'FACL04'], 
                 'AMOUNT': [10, 20, 30, 40, 50, 60, 70, 80, 20, 10, 30, 10], 
                 'DATE': [20201503, 20201503, 20201503, 20201502, 20201503, 20201502, 
                          20201501, 20201503, 20201503, 20201501, 20201503, 20201502]})

df = df.sort_values(['ID','DATE']) #Sort for clarity
dfgroup = df.groupby(['ID', 'DATE']).AMOUNT.sum().rolling(3, min_periods=1).mean()

Output:

ID      DATE    
FACL01  20201502     40.0
        20201503     50.0
FACL04  20201501     60.0
        20201502     70.0
        20201503    110.0

If you want to add this to your dataframe you could do something like:

dfgroup.name = 'Average_Amount'
df = df.merge(dfgroup.reset_index())

Output 2:

        ID  AMOUNT      DATE  Average_Amount
0   FACL01      40  20201502            40.0
1   FACL01      10  20201503            50.0
2   FACL01      20  20201503            50.0
3   FACL01      30  20201503            50.0
4   FACL04      70  20201501            60.0
5   FACL04      10  20201501            60.0
6   FACL04      60  20201502            70.0
7   FACL04      10  20201502            70.0
8   FACL04      50  20201503           110.0
9   FACL04      80  20201503           110.0
10  FACL04      20  20201503           110.0
11  FACL04      30  20201503           110.0