Select "entities" where only there are several groups of another characteristic (another column)

80 views Asked by At

I am sorry in advance because that's very difficult to express well with only one question in a good English for me.

I use pandas with python.

Let's say for any game (represented by an id) there are several individuals with their characteristics. One of these characteristics is to be in a group (XL, X, XS).

One important thing is, in one game we can have only one group of individuals represented.
Then in my descriptive statistics, considering all the games, the group XS is better than X. But I really want to confirm if they really are when they really face the group X in a game.

Let's see with a groupby method on the dataframe:

DF.groupby(['ID','GROUP']).mean() #notice the only other column is the score
+---------+-------+---------------+
|   ID    | GROUP | MEAN OF SCORE |
+---------+-------+---------------+
| 1000046 | XS    |          4.50 |
| 1000047 | XS    |          6.41 |
| 1000051 | X     |          3.00 |
|         | XS    |          3.75 |
+---------+-------+---------------+

The dataframe is like:

+---------+-------+-------+
|   ID    | GROUP | SCORE |
+---------+-------+-------+
| 1000046 | XS    |  5.00 |
| 1000046 | XS    |  5.00 |
| 1000046 | XS    |  4.00 |
| 1000046 | XS    |  4.00 |
| 1000047 | XS    |  6.41 |
| 1000047 | XS    |  6.41 |
| 1000047 | XS    |  6.41 |
| 1000051 | X     |  3.00 |
| 1000051 | X     |  3.00 |
| 1000051 | X     |  3.00 |
| 1000051 | XS    |  3.75 |
| 1000051 | XS    |  3.75 |
| 1000051 | XS    |  3.75 |
+---------+-------+-------+

As you can see, XS is the only category in some games and it biases my stats understanding. So I want to select the IDs of games which have several categories, as 1000051.

I had a look on the attribute groups of the groupby object, the problem is the tuples only have two values ('1000051','X),('1000051','XS'), and do not gives information if one ID (game) contains more than one group like ('1000051','X','XS').

Well, I know I can make an algorithm in order to obtain a dict like the following:

Ids_groups = {
 '1000046': ['XS'],
 '1000047': ['XS'],
 '1000051' : ['XS','X']
}

Then I can keep only the keys and values where 'XS' is in values (a list) and the length of the values is over 1. Then use the list of keys ['1000051',...] to select wanted rows of the dataframe.

So I ask you if there is a more clever way to do it, a more efficient one too.

Pandas version: 0.23.4 Python version: 3.7.4

2

There are 2 answers

2
ansev On BEST ANSWER

Use GroupBy.filter:

df_filtered=df.groupby('ID').filter(lambda x: x.GROUP.nunique()>1)
print(df_filtered)

         ID GROUP  SCORE
7   1000051     X   3.00
8   1000051     X   3.00
9   1000051     X   3.00
10  1000051    XS   3.75
11  1000051    XS   3.75
12  1000051    XS   3.75

Also we can use Groupby.transform to performance a boolean indexing:

df_filtered=df[df.groupby('ID').GROUP.transform('nunique')>1]

print(df)

          ID GROUP  SCORE
0   1000046    XS   5.00
1   1000046    XS   5.00
2   1000046    XS   4.00
3   1000046    XS   4.00
4   1000047    XS   6.41
5   1000047    XS   6.41
6   1000047    XS   6.41
7   1000051     X   3.00
8   1000051     X   3.00
9   1000051     X   3.00
10  1000051    XS   3.75
11  1000051    XS   3.75
12  1000051    XS   3.75
0
Celius Stingher On

I would use nunique() in order to know how many groups are there per game, afterwards you can use it within a filter function, or simply as an extra variable to perform your calculations:

import pandas as pd
d = {'ID':[1,1,1,1,2,2,2,3,3],'GROUP':['XS','XS','X','XS','XS','XS','XS','XS','X'],'SCORE':[3,4,2,3,4,3,3,2,1]}
df = pd.DataFrame(d)

df_1 = df.groupby(['ID'],as_index=False).agg({'GROUP':'nunique'})
print(df_1)

Output:

   ID  GROUP
0   1      2
1   2      1
2   3      2

Therefore you know now that for game 1, there were 2 groups participating, for game 2 only 1 and for game 3 there were 2 groups as well.