An example of my data
| StreetAddress | City | State | Zip |
|---|---|---|---|
| 1 Main St 01123 | Winsted | CT | |
| 1 Main St | Winsted | CT | 01123 |
I am trying to use regex and pandas to clean a spreadsheet that I have. The problem I am running into is that my regex code is replacing every cell in the entire column even if there is valid data in it.
I tried
df['Zip'] = df['StreetAddress'].str.extract(r'(\d{5})')
df['StreetAddress'] = df['StreetAddress'].str.replace(r'(\d{5})', '', regex=True)
which gives me
| StreetAddress | City | State | Zip |
|---|---|---|---|
| 1 Main St | Winsted | CT | 01123 |
| 1 Main St | Winsted | CT |
I was hoping for something more like this
| StreetAddress | City | State | Zip |
|---|---|---|---|
| 1 Main St | Winsted | CT | 01123 |
| 1 Main St | Winsted | CT | 01123 |
I would use a boolean mask, this will avoid overwriting existing data, and also be more efficient since only the relevant rows will be evaluated:
Alternatively:
Or, as suggested by @ouroboros1, to keep the original Zip:
Output: