Pandas: Drop all string components in a mixed typed series with integers and strings

812 views Asked by At

This drives me nuts. When I searched for tips about dropping elements in a dataframe there was nothing about mixed typed series.

Say here is a dataframe:

import pandas as pd
df = pd.DataFrame(data={'col1': [1,2,3,4,'apple','apple'], 'col2': [3,4,5,6,7,8]})
a = df['col1']

Then 'a' is a mixed typed series with 6 components. How can I remove all 'apple's from a? I need series = 1,2,3,4.

3

There are 3 answers

0
SeaBean On BEST ANSWER

To retain the integers as integer type without changing them to float:

Approach: filter rows with numeric values to keep (instead of converting non-numeric values to NaN then drop NaN). The difference is that we won't have intermediate result with NaN, which will force the numeric values to change from integer to float.

a = pd.to_numeric(a[a.astype(str).str.isnumeric()])

Result:

The resulting dtype remains as integer type int64

print(a)

0    1
1    2
2    3
3    4
Name: col1, dtype: int64

If you produce intermediate results with NaN like below:

a = pd.to_numeric(a, errors='coerce').dropna()

The resulting dtype is forced to change to float type (instead of remaining as integer)

0    1.0
1    2.0
2    3.0
3    4.0
Name: col1, dtype: float64
0
Golden Lion On

you can drop by label where label contains a list of index values.

df = pd.DataFrame(data={'col1': [1,2,3,4,'apple','apple'], 'col2': [3,4,5,6,7,8]})
df.reset_index(inplace=True)
print(df)

grouped=df.col1.str.isnumeric().eq(0)

labels=set([x for x in (grouped[grouped.values==True].index)])
if len(labels)>0:
    df = df.drop(labels=labels, axis=0)    

output:

   index   col1  col2
0      0      1     3
1      1      2     4
2      2      3     5
3      3      4     6
4      4  apple     7
5      5  apple     8
1
abhishekbasu On

You could use the apply method and flag the strings using a lambda and replace them with a value like NaN to filter them out.

import numpy as np

a = df['col1'].apply(lambda x: np.nan if isinstance(x, str) else x).dropna()

What this piece of code does is:

  • It first replaces all instances of strings in the column with NaN
  • Then drops the NaNs

This also avoids incorrectly coercing a string element that may contain a valid int/float, for example if the column has an element like "12" in it, assuming this is not the behavior you desire.

Further, if you want the final output to be of int type, you could map it like so:

a = df['col1'].apply(lambda x: np.nan if isinstance(x, str) else x).dropna().map(int)