I am sorry in advance because that's very difficult to express well with only one question in a good English for me.
I use pandas with python.
Let's say for any game (represented by an id) there are several individuals with their characteristics. One of these characteristics is to be in a group (XL, X, XS).
One important thing is, in one game we can have only one group of individuals represented.
Then in my descriptive statistics, considering all the games, the group XS is better than X.
But I really want to confirm if they really are when they really face the group X in a game.
Let's see with a groupby method on the dataframe:
DF.groupby(['ID','GROUP']).mean() #notice the only other column is the score
+---------+-------+---------------+
| ID | GROUP | MEAN OF SCORE |
+---------+-------+---------------+
| 1000046 | XS | 4.50 |
| 1000047 | XS | 6.41 |
| 1000051 | X | 3.00 |
| | XS | 3.75 |
+---------+-------+---------------+
The dataframe is like:
+---------+-------+-------+
| ID | GROUP | SCORE |
+---------+-------+-------+
| 1000046 | XS | 5.00 |
| 1000046 | XS | 5.00 |
| 1000046 | XS | 4.00 |
| 1000046 | XS | 4.00 |
| 1000047 | XS | 6.41 |
| 1000047 | XS | 6.41 |
| 1000047 | XS | 6.41 |
| 1000051 | X | 3.00 |
| 1000051 | X | 3.00 |
| 1000051 | X | 3.00 |
| 1000051 | XS | 3.75 |
| 1000051 | XS | 3.75 |
| 1000051 | XS | 3.75 |
+---------+-------+-------+
As you can see, XS is the only category in some games and it biases my stats understanding.
So I want to select the IDs of games which have several categories, as 1000051.
I had a look on the attribute groups of the groupby object, the problem is the tuples only have two values ('1000051','X),('1000051','XS'), and do not gives information if one ID (game) contains more than one group like ('1000051','X','XS').
Well, I know I can make an algorithm in order to obtain a dict like the following:
Ids_groups = {
'1000046': ['XS'],
'1000047': ['XS'],
'1000051' : ['XS','X']
}
Then I can keep only the keys and values where 'XS' is in values (a list) and the length of the values is over 1. Then use the list of keys ['1000051',...] to select wanted rows of the dataframe.
So I ask you if there is a more clever way to do it, a more efficient one too.
Pandas version: 0.23.4 Python version: 3.7.4
Use
GroupBy.filter:Also we can use
Groupby.transformto performance aboolean indexing: