How to merge complex data frames in pandas/dataframe-js/etc

443 views Asked by At

I would imagine what I'm trying to do is fairly simple in pandas but I just can't get it. Really I want to do this in dataframe-js (or danfojs) but any help in either pandas or dataframe-js will be helpful.

Essentially:

  • there can be an array of any number of dataframes with any number of rows and columns.
  • They should all have at least one matching column uuid.
  • Most of the dataframes will have data for every uuid but some might be missing.
  • The only known column name is uuid, so using "merge on" or similar with any other column name isn't an option.

example dataframes:

    let data1 = [
      [['col A', 'uuid'], ['1238', '12']],
      [['col B', 'uuid'], ['42.4', '12']],
      [['col A', 'uuid'], ['1091', '48']],
      [['col B', 'uuid'], ['35.1', '48']],
      [['col B', 'uuid'], ['44.4', '77']],
    ]

desired output (column order doesn't matter):

[
      ['col A', 'uuid', 'col B'],
      ['1238', '12', '42.4'],
      ['1091', '48', '35.1'],
      [null, '77', '44.4'] // null, undefined, NaN...doesn't matter for the gaps
]

please help :)

2

There are 2 answers

0
André On BEST ANSWER

Ok I combined @onyambu's answer with the merge function, which now accepts dataframes of different sizes

# create an initial empty df
t = pd.DataFrame(columns=['uuid'])

# reduce list of dataframes into one
df = reduce(lambda x,y: x.merge(pd.DataFrame(y[1:], columns=y[0]), how='outer'), data1, t)

# squash rows on `uuid` index with stack/unstack
df = df.set_index('uuid').stack().unstack().reset_index()

# output in original "table" format
df2 = np.r_[df.columns.values[None],df.iloc[:].values].tolist()
print(df2)
3
Onyambu On
df = (pd.DataFrame(map(lambda x: dict(zip(*x)), data1)).set_index('uuid').
      stack().unstack().reset_index())

df2 = np.r_[df.columns.values[None,[1,0,2]],df.iloc[:,[1,0,2]].values].tolist()
print(df2)

[['col A', 'uuid', 'col B'],
 ['1238', '12', '42.4'],
 ['1091', '48', '35.1'],
 [nan, '77', '44.4']]