TXR: removing trailing and leading commas in data respecting header line

62 views Asked by At

I have a lot of data like the following:

There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,

[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,

,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,

I want to tidy it up a bit into:

There are many ways data could be missing.

An entire interior column could be missing.

[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2

An indented data with 2 completely missing columns.

[missing/data/outside]
a,b,c
,3,
,4,
,2,

The challenges are:

  • keeping all non-table text annotations (cleanup up any leading or trailing commas)
  • keeping the appropriate number of commas in data tables based on their header

If I didn't have the second challenge, I would just pipe my output through sed:

... | output | sed 's/,*$//g' | sed 's/^,*//g'

I trust that the number of commas to the left of the data will be equal in the header and data lines. However, I can't trust the same for the lagging commas.

I've written the following TXR code:

@(define empty_line)@\
@  (cases)@\
@/,*/@(eol)@\
@  (or)@\
@/[ ]*/@(eol)@\
@  (or)@\
@(eol)@\
@  (end)@\
@(end)
@(define heading)@/[a-z]+(:[^,]*)?/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@/.*/@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@  (collect)
@lead@{data (content head)}@(skip)
@  (until)
@(empty_line)
@  (end)
@(end)
@(collect)
@annotation
@(empty_line)
@(table loc head data)
@(end)
@(output)
@  (repeat)
@annotation

[@loc]
@head
@    (repeat)
@data
@    (end)

@  (end)
@(end)

How might I write the content function to extract out the appropriate number of columns from the input data? I thought maybe it might be as easy as using the coll or rep directives like:

@(define content (hdr))@\
@  (coll :gap 0 :times (length (split-str hdr ",")))@{x /[^,]/}@(end)@\
@(end)

This code doesn't reliably capture nor clean up annotations. Since annotation can exist anywhere that is not a table. How can I extract them and clean them up? I tried a few ways using @(maybe) and another nested @(collect) with no luck.

@  (maybe)
@    (collect)
@/,*/@annotation@/,*/
@    (until)
@(empty_line)
@/,*/[@loc]@(skip)
@    (end)
@  (end)

Update:

I tried to solve just the table data collection part independently, for which I wrote the following code:

@(define heading)@/[^,]+/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@\
@  (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))@\
@/[^,]*/@\
@  (end)@\
@(end)
@{lead /,*/}@{head (header)}@(skip)
@(collect :gap 0 :vars (data))
@lead@{data (content head)}@/,*/
@(end)
@(output)
@head
@  (repeat)
@data
@  (end)
@(end)

Here is my sample data:

,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,

The code gives the correct result in all cases except for the penultimate line. It seems to me the trick to solving this problem is to write a regular expression for coll that correctly extracts blank data. Is there another approach that would make this possible? For example, appending the necessary remaining commas?

2

There are 2 answers

1
wdkrnls On BEST ANSWER

Below is code which seems to work:

@(define empty_line)@\
@  (cases)@\
@/,*/@(eol)@\
@  (or)@\
@/[ ]*/@(eol)@\
@  (or)@\
@(eol)@\
@  (end)@\
@(end)
@(define heading)@/[^,]+/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@\
@/[^,]*/@\
@  (coll :gap 0 :times (- (length (split-str hdr ",")) 1))@\
,@/[^,]*/@\
@  (end)@\
@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@  (collect)
@lead@{data (content head)}@(skip)
@  (until)
@(empty_line)
@  (end)
@(end)
@(collect)
@  (collect)
@/,*/@{annotation /[A-Za-z0-9]+.*[^,]+/}@/,*/
@  (until)
@    (cases)
@(empty_line)
@/,*/[@loc]@(skip)
@    (or)
@(eof)
@    (end)
@  (end)
@(empty_line)
@(table loc head data)
@(end)
@(output)
@  (repeat)
@    (repeat)
@annotation

@    (end)
[@loc]
@head
@    (repeat)
@data
@    (end)

@  (end)
@(end)
0
Kaz On

Just for reference, here is something I hacked up using somewhat different approach. Input is split early into fields, and things proceed from there.

It works on the sample data but doesn't capture it in the right way (following the syntax of annotation lines, empty line, table). Also, it isn't checking whether the data lines in the table have only blank fields before the indented position.

There may be something of use in this anyway.

@(define get-fields (f line))
@  (bind f @(split-str line ","))
@(end)
@(define is-empty (f line))
@  (require (or [all f empty]
                [all line (op eql #\space)]))
@(end)
@(define is-table-start (f loc pos))
@  (next :list f)
@  (skip)
@  (line pos)
[@loc]
@  (rebind pos @(pred pos))
@  (require (and [all [f 0..pos] empty]
                 [all [f (succ pos)..:] empty]))
@(end)
@(define is-headings (f pos))
@  (require (and [all [f 0..pos] empty]
                 (empty [drop-while empty
                                    (drop-while (f^$ #/[a-z]+(:[^,]*)?/)
                                                [f pos..:])])))
@(end)
@(define out-fields (f))
@  (do (put-line `@{f ","}`))
@(end)
@(repeat)
@line
@  (get-fields f line)
@  (cases)
@    (is-empty f line)
@    (do (put-line))
@  (or)
@    (is-table-start f loc pos)
@    hline
@    (get-fields hf hline)
@    (is-headings hf pos)
@    (collect :gap 0)
@      dline
@      (get-fields df dline)
@    (until)
@      (is-empty df dline)
@    (end)
@    (do (put-line `[@loc]`))
@    (bind headings @(take-while [notf empty] (drop pos hf)))
@    (bind endpos @(+ pos (length headings)))
@    (merge tbl hf df)
@    (output)
@      (repeat)
@        {tbl [pos..endpos] ","}
@      (end)
@    (end)
@  (or)
@    (bind trim-f @[take-while [notf empty] [drop-while empty f]])
@    (do (put-line `@{trim-f ","}`))
@  (end)
@(end)