How to make regex code apply only to empty target cells

48 views Asked by At

An example of my data

StreetAddress City State Zip
1 Main St 01123 Winsted CT
1 Main St Winsted CT 01123

I am trying to use regex and pandas to clean a spreadsheet that I have. The problem I am running into is that my regex code is replacing every cell in the entire column even if there is valid data in it.

I tried

df['Zip'] = df['StreetAddress'].str.extract(r'(\d{5})')
df['StreetAddress'] = df['StreetAddress'].str.replace(r'(\d{5})', '', regex=True)

which gives me

StreetAddress City State Zip
1 Main St Winsted CT 01123
1 Main St Winsted CT

I was hoping for something more like this

StreetAddress City State Zip
1 Main St Winsted CT 01123
1 Main St Winsted CT 01123
1

There are 1 answers

2
mozway On BEST ANSWER

I would use a boolean mask, this will avoid overwriting existing data, and also be more efficient since only the relevant rows will be evaluated:

add = df['StreetAddress'].str.extract(r'(\d{5})', expand=False)
m = add.notna()
df.loc[m, 'Zip'] = add[m]
df.loc[m, 'StreetAddress'] = (df.loc[m, 'StreetAddress']
                                .str.replace(r' *\d{5}', '', regex=True)
                             )

Alternatively:

df['Zip'] = df['StreetAddress'].str.extract(r'(\d{5})', expand=False).fillna(df['Zip'])

Or, as suggested by @ouroboros1, to keep the original Zip:

df['Zip'].fillna(df['StreetAddress'].str.extract(r'(\d{5})', expand=False))

Output:

  StreetAddress     City State    Zip
0     1 Main St  Winsted    CT  01123
1     1 Main St  Winsted    CT  01123