I have a lot of data like the following:
There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,
[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,
,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,
I want to tidy it up a bit into:
There are many ways data could be missing.
An entire interior column could be missing.
[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2
An indented data with 2 completely missing columns.
[missing/data/outside]
a,b,c
,3,
,4,
,2,
The challenges are:
- keeping all non-table text annotations (cleanup up any leading or trailing commas)
- keeping the appropriate number of commas in data tables based on their header
If I didn't have the second challenge, I would just pipe my output through sed
:
... | output | sed 's/,*$//g' | sed 's/^,*//g'
I trust that the number of commas to the left of the data will be equal in the header and data lines. However, I can't trust the same for the lagging commas.
I've written the following TXR
code:
@(define empty_line)@\
@ (cases)@\
@/,*/@(eol)@\
@ (or)@\
@/[ ]*/@(eol)@\
@ (or)@\
@(eol)@\
@ (end)@\
@(end)
@(define heading)@/[a-z]+(:[^,]*)?/@(end)
@(define header)@\
@ (cases)@\
@ (heading),@(header)@\
@ (or)@\
@ (heading)@\
@ (end)@\
@(end)
@(define content (hdr))@/.*/@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@ (collect)
@lead@{data (content head)}@(skip)
@ (until)
@(empty_line)
@ (end)
@(end)
@(collect)
@annotation
@(empty_line)
@(table loc head data)
@(end)
@(output)
@ (repeat)
@annotation
[@loc]
@head
@ (repeat)
@data
@ (end)
@ (end)
@(end)
How might I write the content
function to extract out the appropriate number of columns from the input data? I thought maybe it might be as easy as using the coll
or rep
directives like:
@(define content (hdr))@\
@ (coll :gap 0 :times (length (split-str hdr ",")))@{x /[^,]/}@(end)@\
@(end)
This code doesn't reliably capture nor clean up annotations. Since annotation can exist anywhere that is not a table. How can I extract them and clean them up? I tried a few ways using @(maybe)
and another nested @(collect)
with no luck.
@ (maybe)
@ (collect)
@/,*/@annotation@/,*/
@ (until)
@(empty_line)
@/,*/[@loc]@(skip)
@ (end)
@ (end)
Update:
I tried to solve just the table data collection part independently, for which I wrote the following code:
@(define heading)@/[^,]+/@(end)
@(define header)@\
@ (cases)@\
@ (heading),@(header)@\
@ (or)@\
@ (heading)@\
@ (end)@\
@(end)
@(define content (hdr))@\
@ (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))@\
@/[^,]*/@\
@ (end)@\
@(end)
@{lead /,*/}@{head (header)}@(skip)
@(collect :gap 0 :vars (data))
@lead@{data (content head)}@/,*/
@(end)
@(output)
@head
@ (repeat)
@data
@ (end)
@(end)
Here is my sample data:
,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,
The code gives the correct result in all cases except for the penultimate line. It seems to me the trick to solving this problem is to write a regular expression for coll
that correctly extracts blank data. Is there another approach that would make this possible? For example, appending the necessary remaining commas?
Below is code which seems to work: