Pandas data frame use interpolate() partitioning with specific columns

51 views Asked by At

I have the following Pandas data frame (called df).

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | NaN    |
| John   | Cats   | 2002 | NaN    |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | NaN    |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

My target is to get the following, which means using the interpolate method to fill the NaN values, but based on the other column value. In other words, it should

  1. partition the df using the Person and Animal columns
  2. order by Year (asc)
  3. apply the interpolate method

.

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | 2      |
| John   | Cats   | 2002 | 3      |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | 3      |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

What I have done

I can filter for each Person and each Animal and then apply the interpolate methods. Finally, merge all together, but this sounds dull and long if you have many columns.

2

There are 2 answers

0
Corralien On BEST ANSWER

You can try:

df['Number'] = (df.sort_values('Year', ascending=True)
                  .groupby(['Person', 'Animal'])['Number']
                  .transform(lambda x: x.interpolate()))
print(df)

# Output
   Person Animal  Year  Number
0    John   Dogs  2000     2.0
1    John   Dogs  2001     2.0
2    John   Dogs  2002     2.0
3    John   Dogs  2003     2.0
4    John   Dogs  2004     2.0
5    John   Dogs  2005     2.0
6    John   Cats  2000     1.0
7    John   Cats  2001     2.0  # interpolate
8    John   Cats  2002     3.0  # interpolate
9    John   Cats  2003     4.0
10   John   Cats  2004     5.0
11   John   Cats  2005     6.0
12  Peter   Dogs  2000     NaN
13  Peter   Dogs  2001     1.0
14  Peter   Dogs  2002     3.0
15  Peter   Dogs  2003     5.0
16  Peter   Dogs  2004     5.0
17  Peter   Dogs  2005     5.0
18  Peter   Cats  2000     NaN
19  Peter   Cats  2001     4.0
20  Peter   Cats  2002     4.0
21  Peter   Cats  2003     4.0
22  Peter   Cats  2004     4.0
23  Peter   Cats  2005     4.0

For multiple columns, just use the same operation:

cols = ['Number']  # list of columns
df[cols] = (df.sort_values('Year', ascending=True)
              .groupby(['Person', 'Animal'])[cols]
              .transform(lambda x: x.interpolate()))
0
Laurent B. On

Proposed script

Another readable solution :

def interpolate_values(group):
    group = group.sort_values('Year')
    group['Number'] = group['Number'].interpolate()
    return group

df['Number'] = (df.groupby(['Person', 'Animal'], group_keys=False)
                  .apply(interpolate_values)['Number'])
   Person Animal  Year  Number
0    John   Dogs  2000     2.0
1    John   Dogs  2001     2.0
2    John   Dogs  2002     2.0
3    John   Dogs  2003     2.0
4    John   Dogs  2004     2.0
5    John   Dogs  2005     2.0
6    John   Cats  2000     1.0
7    John   Cats  2001     2.0
8    John   Cats  2002     3.0
9    John   Cats  2003     4.0
10   John   Cats  2004     5.0
11   John   Cats  2005     6.0
12  Peter   Dogs  2000     NaN
13  Peter   Dogs  2001     1.0
14  Peter   Dogs  2002     3.0
15  Peter   Dogs  2003     5.0
16  Peter   Dogs  2004     5.0
17  Peter   Dogs  2005     5.0
18  Peter   Cats  2000     NaN
19  Peter   Cats  2001     4.0
20  Peter   Cats  2002     4.0
21  Peter   Cats  2003     4.0
22  Peter   Cats  2004     4.0
23  Peter   Cats  2005     4.0