How to concat on axis=1 with Dask delayed? (simplified)

675 views Asked by At

Pandas and Dask produce different results (because I'm doing something wrong in Dask I think). I want to get the Dask result to match the Pandas one here.

This toy program should run as-is to demonstrate:

import dask
import dask.dataframe as ddf
import pandas as pd


# This creates a toy pd.DataFrame
def get(ii):
    x = 2 * ii
    return pd.DataFrame.from_dict({'a':[x + 1, x + 2]})


if __name__ == '__main__':

    print('Using Pandas')
    df1 = get(0)
    df2 = get(1)
    pandas_df = pd.concat([df1, df2], axis=1)
    print(pandas_df)

    print('\n\nUsing Dask')
    output = []
    for ii in range(2):
        output.append(dask.delayed(get)(ii))
    temp = ddf.from_delayed(output)
    temp2 = ddf.concat([temp], axis=1)
    dask_df = temp2.compute()
    print(dask_df)

And the output:

Using Pandas (this is what I want)
   a  a
0  1  3
1  2  4


Using Dask (oops what happened here?)
   a
0  1
1  2
0  3
1  4
1

There are 1 answers

0
KRKirov On BEST ANSWER
import dask.dataframe as ddf
from dask import delayed
import pandas as pd

def get(ii):
    x = 2 * ii
    return pd.DataFrame.from_dict({'a':[x + 1, x + 2]})

@delayed()
def make_daskdf(*num):
    df_list = []
    for i in num:
        df_list.append(get(i))
    df = pd.concat(df_list, axis=1)
    return df

dask_df = make_daskdf(0, 1, 2, 3).compute()

Output:

dask_df
Out[38]: 
   a  a  a  a
0  1  3  5  7
1  2  4  6  8