Lambda with Apply and Groupby

2.1k views Asked by At

I am trying to calculate the unique values in a column of a pandas dataframe grouped by a second column and to return the results as a new column in the dataframe.

When I test this operation on the following dataframe it returns null values.

df = pd.DataFrame([('bird', 'Falconiformes', 389.0),   ('bird', 'Psittaciformes', 24.0),   ('mammal', 'Carnivora', 80.2),   ('mammal', 'Primates', np.nan),   ('mammal', 'Carnivora', 58)],   index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],columns=('class', 'order', 'max_speed'))

In pandas 0.18 I was using

df['test'] = df.groupby('class').transform('unique')
Traceback (most recent call last):

  File "<ipython-input-146-283294ac8bef>", line 1, in <module>
    df['test'] = df.groupby('class').transform('unique')

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py", line 1469, in transform
    raise ValueError(msg)

ValueError: 'unique' is not a valid function name for transform(name)

but my admin recently updated pandas and unique is no longer a valid function for transform. There is a thread suggesting apply for pandas 1.1.3 (see discussion). I looked at the new documentation for 1.1.3 and tried to use the following

df['test']=df.groupby('class').apply(lambda x: x['max_speed'].unique())
df
Out[135]: 
     index   class           order  max_speed test
0   falcon    bird   Falconiformes      389.0  NaN
1   parrot    bird  Psittaciformes       24.0  NaN
2     lion  mammal       Carnivora       80.2  NaN
3   monkey  mammal        Primates        NaN  NaN
4  leopard  mammal       Carnivora       58.0  NaN

but apply isn't expanding the value to other rows even though

df.groupby('class').apply(lambda x: x['max_speed'].unique())
Out[140]: 
class
bird          [389.0, 24.0]
mammal    [80.2, nan, 58.0]
dtype: object

If I try to add the keywords mentioned in the latest apply documentation I get an error message.

 df['test']=df.groupby('class').apply(lambda x: x['max_speed'].unique(), result_type='expand')
Traceback (most recent call last):
 File "<ipython-input-145-9b84754c6daf>", line 1, in <module>
    df['test']=df.groupby('class').apply(lambda x: x['max_speed'].unique(), result_type='expand')

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 870, in apply
    return self._python_apply_general(f, self._selected_obj)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\ops.py", line 213, in apply
    res = f(group)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 843, in f
    return func(g, *args, **kwargs)

TypeError: <lambda>() got an unexpected keyword argument 'result_type'

I know that I can use groupby with aggregate and the unique function and merge the resulting dataframe back in. But I have to do this operation for several different groupings and I would prefer a one line answer.

1

There are 1 answers

1
piterbarg On BEST ANSWER

this is a little hacky but I think it gets you what you want

df.groupby('class').apply(lambda d: d.assign(Test = [d['max_speed'].unique()]*len(d)))

produces

|                       | class   | order          |   max_speed | Test             |
|:----------------------|:--------|:---------------|------------:|:-----------------|
| ('bird', 'falcon')    | bird    | Falconiformes  |       389   | [389.  24.]      |
| ('bird', 'parrot')    | bird    | Psittaciformes |        24   | [389.  24.]      |
| ('mammal', 'lion')    | mammal  | Carnivora      |        80.2 | [80.2  nan 58. ] |
| ('mammal', 'monkey')  | mammal  | Primates       |       nan   | [80.2  nan 58. ] |
| ('mammal', 'leopard') | mammal  | Carnivora      |        58   | [80.2  nan 58. ] |

the trick is to convince assign that d['max_speed'].unique()should be replicated across all relevant rows -- for that we pass a list of length len(d) that has the same element d['max_speed'].unique() for all entries. Here d is each sub-dataframe in the groupby