text matching - unstructured data to structured data - in SAS or R

1.2k views Asked by At

I need to know how to map unstructured data to structured data.

I have a variable that has customer's addresses that includes their cities. The name of the city for example DELHI, can be of the form "DELHI", "DEHLI" "DILLI", "DELI" and I need to detect the city name from these addresses and map it to the correct name that is "DELHI".

I am trying to implement a solution in SAS or R.

3

There are 3 answers

0
Yick Leung On

In SAS this might not be the easiest way, but one way of doing this if your city name is inside the address string is to use the TRANWRD function. This can replace a string inside your address variable. The syntax is:

tranwrd(variable, original_str, new_str);

For example using your city DELHI:

data city;
    input address $1-30;
    datalines;
    1 Ocean drive, DEHLI
    2 Peak road, DELI
    45 Buck street DILLI
    ;       
run;
data change;
    set city;
    address = tranwrd(address,' DEHLI ',' DELHI ');
    address = tranwrd(address,' DELI ',' DELHI ');
    address = tranwrd(address,' DILLI ',' DELHI ');
run;

I put a space before and after both the original and new strings so that it won't replace a correct string that is inside a word (E.g. DELICIOUS Road will be changed to DELHICIOUS Road)

0
user667489 On

If you want to try to automate the process of matching your numerous incorrect values to correct values, you could put together something based on Hamming Distance or Levenshtein distance, perhaps via the COMPGED function. You can calculate a score for each manually input row for each possible matching structured value, then keep the one with the lowest score as your best guess. This will probably not be 100% accurate, but it ought to do a fairly good job far faster than a human could.

0
Joe On

I doubt it is practical to completely code this in an automated fashion, but I would suggest a two step approach.

First, identify possible matches. You can use a number of potential solutions; this is far more complex than a StackOverflow solution, but you have some suggestions already, and you can look at papers on the internet, such as this paper which explains many of the SAS functions and call routines (COMPGED, SPEDIS, COMPLEV, COMPCOST, SOUNDEX, COMPARE).

Use this approach with a fairly broad stroke - ie, prefer false positives to false negatives. Simply focus on identifying words one to one; build a dataset of original, translation, such as

Delli, Delhi
Deli, Delhi
Dalhi, Delhi

etc.

Then visually inspect the file and make corrections as needed (ie, remove false positives).

Once you have this dataset, you have a few options for utilizing the results. If you already have the city name as a separate field, or if you can put it in a separate field or work with it using scan easily to identify just the city, you can use a format solution.

data for_fmt;
set translations;
start=original;
label=translation;
fmtname='$CITYF';
*no hlo=o record as we want to preserve nonmatches as is;
run;

proc format cntlin=for_Fmt;
quit;

data want;
set have;
city_fixed=put(city,$CITYF.);
run;

If you cannot easily identify the city in the address (ie, your address field is something like "10532 NELSON DRIVE DELHI" with no commas or such), then the TRANWRD solution is probably best. You can code a hash-based or array-based solution to implement it (rather than a lot of if statements); if your data does have this problem post a comment and I'll add to the solution later.