dataframe.js - Is it possible to sum multiple columns in a grouped dataframe?

994 views Asked by At

This question is specific to dataframe.js.

Here is the test data I am using

let data = [
    {
        year : 2020,
        v : 0.1,
        cnt_1 : 1,
        cnt_2 : 20
    },
    {
        year : 2020,
        v : 0.1,
        cnt_1 : 3,
        cnt_2 : 20
    },
    {
        year : 2020,
        v : 0.1,
        cnt_1 : 5,
        cnt_2 : 4
    },
    {
        year : 2020,
        v : 0.1,
        cnt_1 : 7,
        cnt_2 : 20
    },
    {
        year : 2020,
        v : 0.2,
        cnt_1 : 9,
        cnt_2 : 20
    },
    {
        year : 2020,
        v : 0.2,
        cnt_1 : 11,
        cnt_2 : 20
    },
    {
        year : 2021,
        v : 0.2,
        cnt_1 : 13,
        cnt_2 : 20
    },
    {
        year : 2020,
        v : 0.1,
        cnt_1 : 15,
        cnt_2 : 20
    },
    {
        year : 2021,
        v : 0.1,
        cnt_1 : 17,
        cnt_2 : 20
    }
];

And The result I expected looks like ...

| year      | v         | cnt_1_sum | cnt_2_sum |
    ------------------------------------
    | 2020      | 0.1       | 31        | 84        |
    | 2020      | 0.2       | 20        | 40        |
    | 2021      | 0.2       | 13        | 20        |
    | 2021      | 0.1       | 17        | 20        |

I could do that with single column like below. But got no idea with multiple columns.(In this case, cnt_1 and cnt_2)

let df = new DataFrame(data);
let grouped = df.groupBy('year', 'v');
let cnt1_sum = grouped.aggregate(grpObj => grpObj.stat.sum('cnt_1')).rename('aggregation', 'cnt_1_sum');
cnt1_sum.show();
// and shows below
| year      | v         | cnt_1_sum |
------------------------------------
| 2020      | 0.1       | 31        |
| 2020      | 0.2       | 20        |
| 2021      | 0.2       | 13        |
| 2021      | 0.1       | 17        |

The only way I know is join 2 dataframes with year and v. But it is so ... inefficient when there are multiple grouped columns.(if got 8 columns then should I have to join 8 dataframes?)

So here is the question. It there anyway to

  • apply stat function to multiple columns ?
  • add a column with data ? (withColumn API is not working with plain array)
2

There are 2 answers

0
Adapptative Team On

I was able to do it by changing slightly Igor's code:

const __groups__ = require('../node_modules/dataframe-js/lib/symbol').__groups__;
const groupedDf = sourceDf.groupBy('key');
const complexAggregateDf = new DataFrame(Object.values(groupedDf[__groups__]).map(({groupKey, group}) => ({
  ...groupKey,
  'sum1': group.stat.sum('col1'),
  'sum2': group.stat.sum('col2'),
}), [...groupedDf.on, 'sum1', 'sum2']);
0
Igor On

It is possible to achieve this by writing something analogous to how aggregate() function works. Here is the code that can get you started:

// assuming that sourceDf has columns ['key', 'col1', 'col2']
const groupedDf = sourceDf.groupBy('key');
const complexAggregateDf = new DataFrame([...groups].map(({groupKey, group}) => ({
  ...groupKey,
  'sum1': group.stat.sum('col1'),
  'sum2': group.stat.sum('col2'),
}), [...groupedDf.on, 'sum1', 'sum2']);