The data in excel sheets is stored as follows:
Area | Product1 | Product2 | Product3
| sales|sales.Value| sales |sales.Value | sales |sales.Value
Location1 | 20 | 20000 | 25 | 10000 | 200 | 100
Location2 | 30 | 30000 | 3 | 12300 | 213 | 10
the product name is a merge of 2 cells of two rows "no of sales" and "sales value" for each of 1000 or so areas for a given month. Similarly there are separate files for each month for the last 5 years. Further, new products have been added and removed in different months. So a different month file might look like:
Area | Product1 | Product4 | Product3
Can the forum suggest the best way to read this data using pandas? I can't use index since the product columns are different each month
Ideally, I would like to convert the initial format above to:
Area | Product1.sales|Product1.sales.Value| Product2.sales |Product2.sales.Value |
Location1 | 20 | 20000 | 25 | 10000 |
Location2 | 30 | 30000 | 3 | 12300 |
import pandas as pd
xl_file = read_excel("file path", skiprow=2, sheetname=0)
/* since the first two rows are always blank */
0 1 2 3 4
0 NaN NaN NaN Auto loan NaN
1 Branch Code Branch Name Region No of accounts Portfolio Outstanding
2 3000 Name1 Central 0 0
3 3001 Name2 Central 0 0
I want to convert it to Auto loan.No of account
, Auto loan.Portfolio Outstanding
as the headers.
Suppose your DataFrame is
df
:so that it looks like this:
Then first forward fill the NaNs in the first two rows (thus propagating 'Auto loan', for example).
Next fill in the remaining NaNs with empty strings:
Now join the two rows together with
.
and assign that as the column level values:And finally, remove the first two rows:
This yields
Alternatively, you could create a MultiIndex column instead of creating a flat column index:
Now
df
looks like this:the column is a MultiIndex:
The column has two levels. The first level has values
[u'Area', u'Auto loan']
, the second has values[u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']
.You can then access a column by specifing the value from both levels:
One advantage of using a MultiIndex is that you can easily select all columns which have a certain level value. For instance, to select the sub-DataFrame having to do with
Auto loans
you could use:For more on selecting rows and columns from a MultiIndex, see MultiIndexing Using Slicers.