Concat list of pandas data frame, but ignoring column name

Question

Concat list of pandas data frame, but ignoring column name

10.7k views Asked by Darren Cook At 19 December 2016 at 15:10

Sub-title: Dumb it down pandas, stop trying to be clever.

I've a list (res) of single-column pandas data frames, each containing the same kind of numeric data, but each with a different column name. The row indices have no meaning. I want to put them into a single, very long, single-column data frame.

When I do pd.concat(res) I get one column per input file (and loads and loads of NaN cells). I've tried various values for the parameters (*), but none that do what I'm after.

Edit: Sample data:

res = [
    pd.DataFrame({'A':[1,2,3]}),
    pd.DataFrame({'B':[9,8,7,6,5,4]}),
    pd.DataFrame({'C':[100,200,300,400]}),
]

I have an ugly-hack solution: copy every data frame and giving it a new column name:

newList = []
for r in res:
  r.columns = ["same"]
  newList.append(r)
pd.concat( newList, ignore_index=True )

Surely that is not the best way to do it??

BTW, pandas: concat data frame with different column name is similar, but my question is even simpler, as I don't want the index maintained. (I also start with a list of N single-column data frames, not a single N-column data frame.)

*: E.g. axis=0 is default behaviour. axis=1 gives an error. join="inner" is just silly (I only get the index). ignore_index=True renumbers the index, but I stil gets lots of columns, lots of NaNs.

UPDATE for empty lists

I was having problems (with all the given solutions) when the data had an empty list, something like:

res = [
    pd.DataFrame({'A':[1,2,3]}),
    pd.DataFrame({'B':[9,8,7,6,5,4]}),
    pd.DataFrame({'C':[]}),
    pd.DataFrame({'D':[100,200,300,400]}),
]

The trick was to force the type, by adding .astype('float64'). E.g.

pd.Series(np.concatenate([df.values.ravel().astype('float64') for df in res]))

or:

pd.concat(res,axis=0).astype('float64').stack().reset_index(drop=True)

Original Q&A

There are 2 answers

jezrael On 19 December 2016 at 15:11

I think you need concat with stack:

print (pd.concat(res, axis=1))
     A  B      C
0  1.0  9  100.0
1  2.0  8  200.0
2  3.0  7  300.0
3  NaN  6  400.0
4  NaN  5    NaN
5  NaN  4    NaN

print (pd.concat(res, axis=1).stack().reset_index(drop=True))
0       1.0
1       9.0
2     100.0
3       2.0
4       8.0
5     200.0
6       3.0
7       7.0
8     300.0
9       6.0
10    400.0
11      5.0
12      4.0
dtype: float64

Another solution with numpy.ravel for flattening:

print (pd.Series(pd.concat(res, axis=1).values.ravel()).dropna())
0       1.0
1       9.0
2     100.0
3       2.0
4       8.0
5     200.0
6       3.0
7       7.0
8     300.0
10      6.0
11    400.0
13      5.0
16      4.0
dtype: float64

print (pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna())
      col
0     1.0
1     9.0
2   100.0
3     2.0
4     8.0
5   200.0
6     3.0
7     7.0
8   300.0
10    6.0
11  400.0
13    5.0
16    4.0

Solution with list comprehension:

print (pd.Series(np.concatenate([df.values.ravel() for df in res])))
0       1
1       2
2       3
3       9
4       8
5       7
6       6
7       5
8       4
9     100
10    200
11    300
12    400
dtype: int64

**Steven G** · Accepted Answer · 2016-12-19T16:56:00+00:00

I would use list comphrension such has:

import pandas as pd
res = [
    pd.DataFrame({'A':[1,2,3]}),
    pd.DataFrame({'B':[9,8,7,6,5,4]}),
    pd.DataFrame({'C':[100,200,300,400]}),
]


x = []
[x.extend(df.values.tolist()) for df in res]
pd.DataFrame(x)

Out[49]: 
      0
0     1
1     2
2     3
3     9
4     8
5     7
6     6
7     5
8     4
9   100
10  200
11  300
12  400

I tested speed for you.

%timeit x = []; [x.extend(df.values.tolist()) for df in res]; pd.DataFrame(x)
10000 loops, best of 3: 196 µs per loop
%timeit pd.Series(pd.concat(res, axis=1).values.ravel()).dropna()
1000 loops, best of 3: 920 µs per loop
%timeit pd.concat(res, axis=1).stack().reset_index(drop=True)
1000 loops, best of 3: 902 µs per loop
%timeit pd.DataFrame(pd.concat(res, axis=1).values.ravel(), columns=['col']).dropna()
1000 loops, best of 3: 1.07 ms per loop
%timeit pd.Series(np.concatenate([df.values.ravel() for df in res]))
10000 loops, best of 3: 70.2 µs per loop

looks like

pd.Series(np.concatenate([df.values.ravel() for df in res]))

is the fastest.

TechQA.

Concat list of pandas data frame, but ignoring column name

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in RBIND

Popular Questions

Popular Tags

Trending Questions