Pandas Assign, Lambda, List Comprehension Question

54 views Asked by At

I receive data as a list of dicts in a single column. Each list can be a different length. Sample data looks like this:

df = pd.DataFrame(
    [
        [[{'value': 1}, {'value': 2}, {'value': 3}]],
        [[{'value': 4}, {'value': 5}]]
    ],
    columns=['data'],
)

df
                                          data
0   [{'value': 1}, {'value': 2}, {'value': 3}]
1   [{'value': 4}, {'value': 5}]

I want to create a new column min_val which contains the minimum value for each row. I'm trying this:

df.assign(min_val=lambda row: min(val['value'] for val in row.data))

But I get the error:

TypeError: list indices must be integers or slices, not str

A very similar lambda/comprehension combination works in Dask Bag but not in raw Pandas, which is very confusing.

Any help would be very much appreciated.

3

There are 3 answers

1
Nick On BEST ANSWER

assign with a callable argument works on the entire dataframe, not on rows, so you need to then apply your function to the data series:

df = df.assign(min_val=df.data.apply(lambda r:min(v['value'] for v in r)))

Output:

                                         data  min_val
0  [{'value': 1}, {'value': 2}, {'value': 3}]        1
1                [{'value': 4}, {'value': 5}]        4
1
Diganto Bhowmik On
df['min_val'] = df['data'].apply(lambda x: min(item['value'] for item in x))
1
Timeless On

That's because your listcomp is iterating through the column "data" (as a whole) of the new DataFrame returned by assign and not through the dicts of each list/row.

# 1st iteration
# `val` is equal to [{'value': 1}, {'value': 2}, {'value': 3}]
# thus, val["value"] (list[str]) will trigger the TypeError

# 2nd iteration
# `val` would be equal to [{'value': 4}, {'value': 5}]]

To fix that, an option would be to add another loop, so you can reach the keys/values of each dict :

out = df.assign(min_val= [min(k["value"] for k in d) for d in df["data"]])

# 167 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Output :

print(out)

                                         data  min_val
0  [{'value': 1}, {'value': 2}, {'value': 3}]        1
1                [{'value': 4}, {'value': 5}]        4