Unidata - Extract Count of strings from MultiValued fields

627 views Asked by At

I'm a bit of a newbie when it comes to Unidata.

My problem is this;

I have lots of records with 20 fields. I want to extract a TOTAL count of the different values used in 1 of those fields; the field in question is multi-valued, so the data looks like "CSR²STR²CD2" etc.

I want to output to excel so the data looks something like the below, containing the COUNT for each value and how many times it appears in that field, across all records.

Column1 Column2
CSR 234235
STR 987346
CD2 736252

I've been told I can achieve this using a dictionary that calculates this or a subroutine. But I have very little technical grounding in Unidata, so where to start is a bit beyond me.

2

There are 2 answers

0
Ian McGowan On BEST ANSWER

It sounds like you need an exploding sort? Multivalued (MV) fields, while they are very convenient in Basic code, can be sort of a pain to deal with in the query language. There is explicit support for them, but it's not as simple as non-MV data. The key is to do an "exploding" sort, which will flatten the MV fields - I think of it as creating "virtual" rows for each one of the MV fields. If there are multiple MV fields and they are associated correctly with each other, they will be linked together in the virtual rows. Single valued and non-associated MV fields will be duplicated in each of the virtual rows. It's difficult to explain, but for your question here's an example query:

The query and results:

>sort IAN.TEMP BY.EXP ATB BREAK.ON ATB TOTAL COUNTER ID.SUP DET.SUP

ATB.. COUNT
CD2       3
CD3       3
CD4       2
CSR       3
IAN       2
      =====
TOTAL    13
13 records listed

Setting up the dictionary:

>AE DICT IAN.TEMP ATB COUNTER
< 1 > Top of "ATB" in "DICT IAN.TEMP", 7 lines, 11 characters.
*--: P
001: D
002: 1
003:
004:
005: 5L
006: M  <--- This is very important!  Must be M (or MV) for BY.EXP to work
007:    <--- If there are associated MV fields, this needs to be populated
Bottom.
*--: EX
Quit "ATB" in file "DICT IAN.TEMP" unchanged.
< 2 > Top of "COUNTER" in "DICT IAN.TEMP", 6 lines, 10 characters.
*--: P
001: I
002: 1  <-- This just returns "1" for every row in the output, to help with totals
003:
004:
005: 5R
006: S
Bottom.
*--: EX
Quit "COUNTER" in file "DICT IAN.TEMP" unchanged.

Setting up the sample data:

>AE IAN.TEMP *

4 record(s) selected.

< 1 > Top of "3" in "IAN.TEMP", 1 line, 7 characters.
*--: P
001: CD2▒IAN
Bottom.
*--: EX
Quit "3" in file "IAN.TEMP" unchanged.
< 2 > Top of "1" in "IAN.TEMP", 1 line, 15 characters.
*--: P
001: CSR▒CD2▒CD3▒IAN
Bottom.
*--: EX
Quit "1" in file "IAN.TEMP" unchanged.
< 3 > Top of "4" in "IAN.TEMP", 1 line, 15 characters.
*--: P
001: CSR▒CD2▒CD3▒CD4
Bottom.
*--: EX
Quit "4" in file "IAN.TEMP" unchanged.
< 4 > Top of "2" in "IAN.TEMP", 1 line, 11 characters.
*--: P
001: CD4▒CD3▒CSR
Bottom.
*--: EX
Quit "2" in file "IAN.TEMP" unchanged.
0
TonyG On

Depending on your needs the solution would be different. For example...

You can use the SREFORMAT command to extract values from specific attributes into a new file, and then easily get counts of items from that file. In your example you'd have an item CSR with attribute 1 containing all of the keys of the items that included CSR. Then you'd just need another dict item to display the value count of atb1. That would result in the exact output that you've described.

You could do this programmatically with code like the following.

* Assume a1 looks like CSR]CD2]CSR]CSR]CD2...
VALS = DCOUNT(REC<1>,@VM)
COUNTS = ""
FOR V = 1 TO VALS
  LOCATE(REC<1,V>,COUNTS,1;POS) THEN
     COUNTS<2,POS> += 1
  END ELSE
     COUNTS<1,-1> = REC<1,V>
     COUNTS<2,-1> = 0
  END
NEXT V
* Result in COUNTS:
* A1 = CSR]CD2
* A2 = 3]2

Does that fit your use case?