SAS: Extract ID's separated by dashes from text string

1.5k views Asked by At

I have a dataset that has one concatenated text field. I am trying to break it into three text columns in SAS 9.4.

Obs Var1
1   MAY12-KANSAS-ABCD6194-XY7199-BRULE
2   JAN32-OHIO-BZ5752-GARY

My output for observation 1 should look like this:

Obs   Date   State   ID
1     MAY12  KANSAS  ABCD6194-XY7199-BRULE

Here's what I have, which works for the Date and State. However, I can't get the third part (ID) to ignore the delimiter:

data have;
   input Var1 &$64.;
   cards;
MAY12-KANSAS-ABCD6194-XY7199-BRULE
JAN32-OHIO-BZ5752-GARY
;
run;
data need;
   length id $16;
   set have;
   date = scan(var1,1,'-','o');
   state = scan(var1,2,'-','o');
   id = scan(var1,3,'-','');
run;
4

There are 4 answers

0
Joe On BEST ANSWER

Another different approach to get a multi-delimiter-containing word is to use call scan. It will tell you the position (and length, which we ignore) of the nth word (and can be forwards OR backwards, so this gives you some ability to search in the middle of the string).

Implemented for this particular case it's very simple:

data want;
  set have;
  length 
    date $5
    state $20
    id $50
  ;
  date = scan(var1,1);
  state= scan(var1,2);
  call scan(var1,3,position,length);
  id = substr(var1,position);
run;

position and length are variables that call scan populates with values which we can then use. (We could have done date and state that way also, but it's more work than using the function.)

0
Joe On

_INFILE_ Magic is the search term for this. Basically, we use the _infile_ automatic variable (which contains the line currently being read in from the input file) to take advantage of the easy syntax for inputting.

*Create a dataset with your data;
data have;
  length var1 $100;
  input var1 $;
datalines;
MAY12-KANSAS-ABCD6194-XY7199-BRULE
JAN32-OHIO-BZ5752-GARY
;;;;
run;

data want;
  infile cards dlm='-';  *cards=datalines=in-line data, just used to get something blank to pretend to read from;
  length
    date $5
    state $20
    id $50
  ;                 *set up lengths for your variables;
  set have;         *Now read in your dataset;
  input @1 @@;      *Start the input pointer at the beginning of the imaginary line and do not advance lines ever (since we have no lines!);
  _infile_=var1;    *Assign the contents of var1 to _infile_, so now we have a line;
  input
    date $
    state $
    id $ & @@
    ;               *Input using normal input techniques.  & tells it to read the rest of that line into the id variable (even the dashes), and @@ says stay on this line still (since we have more rows).
  cards;

run;                *Note we do not really have any cards/datalines.;

Note this only works because your embedded delimiters occur at the end of the line. If they were in the middle of the line, this approach would not work and you would have to use a different approach (such as a regular expression, or telling whomever gave you a file formatted in this way to give it to you in a more useful fashion, ie, not using a delimiter that also is in the data).

9
Haikuo Bian On

Regular Expression maybe an option

The detailed documenation can be found here: http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n0r8h2fa8djqf1n1cnenrvm573br.htm

I will suggest you to start with this paper: http://www2.sas.com/proceedings/sugi29/265-29.pdf

Direct explantation on the code may even confuse you worse if you have never touched those PRX functions.

data have;
    input Var1 &$64.;
    cards;
MAY12-KANSAS-ABCD6194-XY7199-BRULE
JAN32-OHIO-BZ5752-GARY
;
run;

data want;
    set have;
    date=scan(var1,1);
    state=scan(var1,2);
    id=prxchange('s/^\w+-\w+-//',1,var1);
run;

Or as a simple fix to your existing code:

data want;
    set have;
    date=scan(var1,1);
    state=scan(var1,2);
/*  id=prxchange('s/^\w+-\w+-//',1,var1);*/
    id=substr(var1,length(cats(date,state))+3);
run;

The expression started by length function will give the start point where to extract your string of interest. Length + Cats function will give the total length of date and state, +3 is offset '-'.

0
Shenglin Chen On

Use tranwrd to drop date and state.

data want;
  set have;
  length 
    date $5
    state $20
    id $50
  ;
  date = scan(var1,1);
  state= scan(var1,2);
  id= tranwrd(var1,cats(date,'-',state,'-'),'');
run;