Pandas and Dask produce different results (because I'm doing something wrong in Dask I think). I want to get the Dask result to match the Pandas one here.
This toy program should run as-is to demonstrate:
import dask
import dask.dataframe as ddf
import pandas as pd
# This creates a toy pd.DataFrame
def get(ii):
x = 2 * ii
return pd.DataFrame.from_dict({'a':[x + 1, x + 2]})
if __name__ == '__main__':
print('Using Pandas')
df1 = get(0)
df2 = get(1)
pandas_df = pd.concat([df1, df2], axis=1)
print(pandas_df)
print('\n\nUsing Dask')
output = []
for ii in range(2):
output.append(dask.delayed(get)(ii))
temp = ddf.from_delayed(output)
temp2 = ddf.concat([temp], axis=1)
dask_df = temp2.compute()
print(dask_df)
And the output:
Using Pandas (this is what I want)
a a
0 1 3
1 2 4
Using Dask (oops what happened here?)
a
0 1
1 2
0 3
1 4
Output: