I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames.

I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. I've tried the following approaches -

1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. The problem is that pd.concat is extremely slow.

2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk.

Code examples

First approach - Final DataFrame with pd.concat:

def pivot_frame(in_df_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    res_cols = in_df.columns.tolist()
    res = pd.DataFrame(columns=res_cols)
    g = in_df.groupby(by=["A"])
    for title, group in g:
        temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
        temp = temp.reset_index()
        temp.insert(0, "A", title)
        res = pd.concat([res, temp], ignore_index=True)
        temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
    return res

Second approach - Writing to disk:

def pivot_frame(in_df_path, ouput_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    with open(ouput_path, 'w') as f:
        csv_writer = csv.writer(f, delimiter=DELIMITER)
        csv_writer.writerow(["A", "B", "C", "D"])
        g = in_df.groupby(by=["A"])
        for title, group in g:
            temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
            temp = temp.reset_index()
            temp.insert(0, JOB_TITLE_COL, title)
            temp.to_csv(f, mode='a', header=False, sep=DELIMITER)

The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. I read about split-apply-combine (e.g. - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) but I haven't found it helpful.

Thanks a lot! :)

1 Answers

1
Guy On

Solved

So Niels Henkens comment really helped and the solution is to just -

result = in_df.groupby(by=["A","B"]).agg({"C": np.mean, "D": np.sum})

Another improvement in performance is to use Dask -

import dask.dataframe as dd
df = dd.read_csv(PATH_TO_FILE, delimiter=DELIMITER)
g = df.groupby(by=["A", "B"]).agg({"C": np.mean, "D": np.sum}).compute().reset_index()