How to remove strings and punctuation from a columns using SAS or PROC SQL?

101 views Asked by At

I have a column with a lot of company names without any cleaning as below. But I want to do the cleaning to remove a list of strings from company names so I can group them together.

list of strings: com,llc,ltd,corp,stores

| Company |
| --- |
| Amazon |
| Amazon.com |
| Amazon.llc |
| Amazon.ltd |
| Amazon corp |
| Amazon stores |

I also want to remove punctuation and strip spaces from the company names, so the desired outcome can be like:

| Company |
| --- |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |
| Amazon |

I have done my project using Python but I need to convert it to SAS or PROC SQL, but I'm not familiar with these 2 languages, so appreciate any help.

1

There are 1 answers

0
Stu Sztukowski On BEST ANSWER

A regular expression should make this easy:

([\. ])(com|ltd|llc|corp|stores)

https://regex101.com/r/FhnEwN/1

In SAS, finding and replacing strings with regex is done with the prxchange function, which works like this:

string = prxchange('s/regex here/replacement string/', times, string);

Where times is the number of replacements you want to make. -1 means to replace every match. If you leave your replacement string as a null value, like this: 's/regex here//' then you will effectively remove the matched value.

Converting it to SAS:

data want;
    set have;
    company = prxchange('s/([\. ])(com|ltd|llc|corp|stores)//', -1, company);
run;
Company
Amazon
Amazon
Amazon
Amazon
Amazon
Amazon