Suppose I have the following dataframe:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
baz one -2.616540 -1.368694
two -1.989319 1.627848
foo one -0.404563 -1.099314
two -2.006166 0.867398
qux one -0.843150 -1.045291
two 2.129620 -2.697217
I know select a sub-dataframe by indexing:
temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
However, if I look at the index, the values of the original index still appear:
temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index
df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:
temp = s.loc[(('bar', 'foo'), slice(None)), :]
but the index still has 'baz' and 'qux' values:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
names=[u'first', u'second'])
To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:
import pandas as pd
def update_multiindex(df):
if isinstance(df.columns, pd.MultiIndex):
new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:, key].empty}
new_df = pd.DataFrame(new_df)
else:
new_df = df.copy()
if isinstance(df.index, pd.MultiIndex):
new_df = {key: new_df.loc[key, :] for key in new_df.index if not new_df.loc[key, :].empty}
new_df = pd.DataFrame(new_df).T
return new_df
temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.
That said, I suspect (but do not know) that
MultiIndex
used this way is built on top ofCategoricalIndex
, which has the method remove_unused_levels(). It may be wrapped byMultiIndex
, but I cannot tell, because...Second,
MultiIndex
is notably missing from the pandas API documentation. I do not useMultiIndex
, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available withMultiIndex
.