Identify specific integers in column of mixed ints and strings

29 views Asked by At

I have a column in a pandas df named specialty that looks like this:

0         1,5
1           1
2     1,2,4,6    
3           2
4           1
5         1,5
6           3
7           3
8           1
9         2,3

I'd like to create a new column called is_1 that contains a 1 for all rows in specialty that contain a 1 and a 0 for rows that don't contain a 1. The output would look like this:

0       1
1       1
2       1
3       0
4       1
5       1
6       0
7       0
8       1
9       0

I'm not sure how to do this with a column of mixed dtypes. Would I just use np.where() with a str.contains() call? Like so:

np.where((part_chars['specialty'] == 1) | part_chars['specialty'].str.contains('1'), 1, 0)

Yep that works...

2

There are 2 answers

0
Corralien On

Update Your code works well for me.

>>> np.where((part_chars['specialty'] == 1) | part_chars['specialty'].str.contains('1'), 1, 0)
array([1, 1, 1, 0, 1, 1, 0, 0, 1, 0])

If you have mixed dtype, you can force the dtype with .astype(str):

>>> np.where(part_chars['specialty'].astype(str).str.contains('1'), 1, 0)
array([1, 1, 1, 0, 1, 1, 0, 0, 1, 0])

You can use str.contains:

part_chars['is_1'] = (part_chars['specialty'].astype(str)
                          .str.contains(r'\b1\b').astype(int))
print(part_chars)

# Output
  specialty  is_1
0       1,5     1
1         1     1
2   1,2,4,6     1
3         2     0
4         1     1
5       1,5     1
6         3     0
7         3     0
8         1     1
9       2,3     0

Alternative with str.split:

part_chars['is_1'] = (part_chars['specialty'].str.split(',', expand=True)
                          .eq('1').any(axis=1).astype(int))
print(part_chars)

# Output
  specialty  is_1
0       1,5     1
1         1     1
2   1,2,4,6     1
3         2     0
4         1     1
5       1,5     1
6         3     0
7         3     0
8         1     1
9       2,3     0
4
mozway On

Use str.contains with a regex that matches full words equal to 1:

part_chars['is_1'] = (part_chars['specialty'].astype(str)
                      .str.contains(r'\b1\b').astype(int)
                     )

Output:

  specialty  is_1
0       1,5     1
1         1     1
2   1,2,4,6     1
3         2     0
4         1     1
5       1,5     1
6         3     0
7         3     0
8         1     1
9       2,3     0
your solution:
part_chars = pd.DataFrame({'specialty': ['1,5', '1', '1,2,4,6', '2', '1', '1,5', '3', '3', '1', '2,3', '21']})
part_chars['is_1'] = np.where((part_chars['specialty'] == 1) | part_chars['specialty'].str.contains('1'), 1, 0)

Output:

   specialty  is_1
0        1,5     1
1          1     1
2    1,2,4,6     1
3          2     0
4          1     1
5        1,5     1
6          3     0
7          3     0
8          1     1
9        2,3     0
10        21     1  # might be unwanted