Updating Pandas MultiIndex after indexing the dataframe

2k views Asked by At

Suppose I have the following dataframe:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
       ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656
baz   one    -2.616540 -1.368694
      two    -1.989319  1.627848
foo   one    -0.404563 -1.099314
      two    -2.006166  0.867398
qux   one    -0.843150 -1.045291
      two     2.129620 -2.697217

I know select a sub-dataframe by indexing:

temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656

However, if I look at the index, the values of the original index still appear:

temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0], [0, 1]],
       names=[u'first', u'second'])

This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing

df.index
df

This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?

To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:

 temp = s.loc[(('bar', 'foo'), slice(None)), :]

but the index still has 'baz' and 'qux' values:

MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
       names=[u'first', u'second'])

To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:

import pandas as pd
def update_multiindex(df):
    if isinstance(df.columns, pd.MultiIndex):
        new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:,     key].empty}    
        new_df = pd.DataFrame(new_df)
    else:
        new_df = df.copy()
    if isinstance(df.index, pd.MultiIndex):
        new_df = {key: new_df.loc[key, :] for key in new_df.index if not     new_df.loc[key, :].empty}
        new_df = pd.DataFrame(new_df).T
    return new_df

temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
4

There are 4 answers

0
Andreus On

Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.

That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex, which has the method remove_unused_levels(). It may be wrapped by MultiIndex, but I cannot tell, because...

Second, MultiIndex is notably missing from the pandas API documentation. I do not use MultiIndex, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex.

8
Alexander On

Try using droplevel.

temp.index = temp.index.droplevel()

>>> temp
               0         1
second                    
one     0.450819 -1.071271
two    -0.371563  0.411808

>>> temp.index
Index([u'one', u'two'], dtype='object')

When dealing with columns, it's the same thing:

df.columns = df.columns.droplevel()

You can also use xs and set the drop_level parameter to True (default value is False):

>>> s.xs('bar', drop_level=True) 
               0         1
second                    
one     0.450819 -1.071271
two    -0.371563  0.411808
2
MangoHands On

There is a difference between the index of s and the index of temp:

In [25]: s.index
Out[25]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [26]: temp.index
Out[26]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0], [0, 1]],
           names=[u'first', u'second'])

Notices that the labels in the MultiIndex are different.

0
Ezekiel Kruglick On

If I understand correctly your usage pattern you may be able to get the best of both worlds. I'm focusing on:

This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing

df.index df

This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?

Consideration (1) is that you want to know the index of what's left. Consideration (2) is that as mentioned above, if you trim the multiindex you can't merge any data back into your original, and also its a bunch of nonobvious steps that aren't really encouraged.

The underlying fundamental is that index does NOT return updated contents for a multiindex if any rows or columns have been deleted and this is not considered a bug because that's not the approved use of MultiIndexes (read more: github.com/pydata/pandas/issues/3686). The valid API access for the current contents of a MultiIndex is get_level_values.

So would it fit your needs to adjust your practice to use this?

df.index.get_level_values(-put your level name or number here-)

For Multiindexes this is the approved API access technique and there are some good reasons for this. If you use get_level_values instead of just .index you'll be able to get the current contents while ALSO preserving all the information in case you want to re-merge modified data or otherwise match against the original indices for comparisons, grouping, etc...

Does that fit your needs?