How to identify records in a DataFrame (Python/Pandas) that contains leading or trailing spaces

497 views Asked by At

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.

The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.

The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.

Please notice. I would like to see the resulting output in a dataframe format.

Thank you!

Link to sample dataframe data

1

There are 1 answers

7
mozway On

You can use:

df['col'].str.startswith(' ')
df['col'].str.endswith(' ')

or with a regex:

df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')

Example:

df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})

df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']

     col  start    end  either
0    abc   True  False    True
1    def  False  False   False
2   ghi   False   True    True
3   jkl    True   True    True

However, this is likely not faster than directly stripping the spaces:

df['col'] = df['col'].str.strip()

   col
0  abc
1  def
2  ghi
3  jkl
updated answer

To detect the columns with leading/traiing spaces, you can use:

cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index

example on the provided link:

Index(['First Name', 'Team'], dtype='object')