Completely remove one index label from a multiindex, in a dataframe

2.6k views Asked by At

Given I have this multiindexed dataframe:

>>> import pandas as p 
>>> import numpy as np
... 
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
...          np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
... 
>>> s = p.Series(np.random.randn(6), index=arrays)
>>> s
bar  one   -1.046752
     two    2.035839
baz  one    1.192775
     two    1.774266
foo  one   -1.716643
     two    1.158605
dtype: float64

How I should do to eliminate index bar?
I tried with drop

>>> s1 = s.drop('bar')
>>> s1
baz  one    1.192775
     two    1.774266
foo  one   -1.716643
     two    1.158605
dtype: float64

Seems OK but bar is still there in some bizarre way:

>>> s1.index
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
           labels=[[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1['bar']
Series([], dtype: float64)
>>> 

How could I get ride of any residue from this index label ?

2

There are 2 answers

2
Alex Huszagh On

Definitely looks like a bug.

s1.index.tolist() returns to the expected value without "bar".

>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]

s1["bar"] returns a null Series.

>>> s1["bar"]
Series([], dtype: float64)

The standard methods to override this don't seem to work either:

>>> del s1["bar"] 
>>> s1["bar"]
Series([], dtype: float64)
>>> s1.__delitem__("bar")
>>> s1["bar"]
Series([], dtype: float64)

However, as expected, trying grab a new key invokes a KeyError:

>>> s1["booz"]
... KeyError: 'booz'

The main difference is when you actually look at the source code between the two in pandas.core.index.py

class MultiIndex(Index):
    ...

    def _get_levels(self):
        return self._levels

    ...

    def _get_labels(self):
        return self._labels

    # ops compat
    def tolist(self):
        """
        return a list of the Index values
        """
        return list(self.values)

So, the index.tolist() and the _labels aren't accessing the same piece of shared information, in fact, they aren't even close to.

So, we can use this to manually update the resulting indexer.

>>> s1.index.labels
FrozenList([[1, 1, 2, 2], [0, 1, 0, 1]])
>>> s1.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])
>>> s1.index.values
array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)

If we compare this to the initial multindexed index, we get

>>> s.index.labels
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
>>> s.index._levels
FrozenList([[u'bar', u'baz', u'foo'], [u'one', u'two']])

So the _levels attributes aren't updated, while the values is.

EDIT: Overriding it wasn't as easy as I thought.

EDIT: Wrote a custom function to fix this behavior

from pandas.core.base import FrozenList, FrozenNDArray

def drop(series, level, index_name):
    # make new tmp series
    new_series = series.drop(index_name)
    # grab all indexing labels, levels, attributes
    levels = new_series.index.levels
    labels = new_series.index.labels
    index_pos = levels[level].tolist().index(index_name)
    # now need to reset the actual levels
    level_names = levels[level]
    # has no __delitem__, so... need to remake
    tmp_names = FrozenList([i for i in level_names if i != index_name])
    levels = FrozenList([j if i != level else tmp_names
                         for i, j in enumerate(levels)])
    # need to turn off validation
    new_series.index.set_levels(levels, verify_integrity=False, inplace=True)
    # reset the labels
    level_labels = labels[level].tolist()
    tmp_labels = FrozenNDArray([i-1 if i > index_pos else i
                                for i in level_labels])
    labels = FrozenList([j if i != level else tmp_labels
                         for i, j in enumerate(labels)])
    new_series.index.set_labels(labels, verify_integrity=False, inplace=True)
    return new_series

Example user:

>>> s1 = drop(s, 0, "bar")
>>> s1.index
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> s1.index.tolist()
[('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')]
>>> s1["bar"]
...
KeyError: 'bar'

EDIT: This seems to be specific to dataframes/series with multiindexing, as the standard pandas.core.index.Index class does not have the same limitations. I would recommend filing a bug report.

Consider the same series with a standard index:

>>> s = p.Series(np.random.randn(6))
>>> s.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
>>> s.drop(0, inplace=True)
>>> s.index
Int64Index([1, 2, 3, 4, 5], dtype='int64')

The same is true for a dataframe

>>> df = p.DataFrame([np.random.randn(6), np.random.randn(6)])
>>> df.index
Int64Index([0, 1], dtype='int64')
>>> df.drop(0, inplace=True)
>>> df.index
Int64Index([1], dtype='int64')
0
Jeff On

See long discussion here.

Bottom line, its not obvious when to recompute the levels, as the operation a user is doing is unknown (think from the Index perspective). For example, say you are dropping, then adding a value to a level (e.g. via indexing). This would be very wasteful and somewhat compute intensive.

In [11]: s1.index
Out[11]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo'], [u'one', u'two']],
           labels=[[1, 1, 2, 2], [0, 1, 0, 1]])

Here is the actual index itself.

In [12]: s1.index.values
Out[12]: array([('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two')], dtype=object)

In [13]: s1.index.get_level_values(0)
Out[13]: Index([u'baz', u'baz', u'foo', u'foo'], dtype='object')

In [14]: s1.index.get_level_values(1)
Out[14]: Index([u'one', u'two', u'one', u'two'], dtype='object')

If you really feel it is necessary to 'get rid' of the removed level, then simply recreate the index. However, it is not harmful at all. These factorizations (e.g. the labels) are hidden from the user (yes they are displayed, but that is to be honest more of a confusion pain point, hence this question).

In [15]: pd.MultiIndex.from_tuples(s1.index.values)
Out[15]: 
MultiIndex(levels=[[u'baz', u'foo'], [u'one', u'two']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])