I'm trying to parse a "summary" region of a bunch of computer reports, where the report names and their associated variables changes from file to file. I give a made up example following the format below:
Summary Report
Bath Tub
Temperature: 30 °C
Water ready
volume: 200000 cm³
Bath Room
Floor Area: 40 ft²
Door Height: 9 ± 0.1 ft
Full Report Set
It's hard to see from the above what the white space looks like, so here is a screenshot of my text editor with visible white space.
The region of interest starts with Summary Report
and ends with Full Report Set
. Properties can potentially span two lines. The property names are aligned such that the colon :
stays at the same character position within each sub-report.
From the diagnostic output, it appears my attempt to exploit this fact is not working.
txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 11 vs. k) txr: (src/generic-micrometrics-report.txr:36) variable k binding mismatch (13 vs. 12) txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 12 vs. k) txr: (src/generic-micrometrics-report.txr:36) string matched, position 13-18 (data/dummy-generic-report.txt:6) txr: (src/generic-micrometrics-report.txr:36) Temperature: 30 °C
txr: (src/generic-micrometrics-report.txr:36) ^ ^ txr: (src/generic-micrometrics-report.txr:23) spec ran out of data txr: (source location n/a) function (capture (nil (k . 13) (report . "Bath Tub"))) failed
I've included the code below. Can you explain why this code does not work? Am I doing what I think I'm doing with the colon_position function? If so, why is it failing? How would you write the capture
function? Is this the general approach you would take? Is there a better way? Thanks so much for all your help and advice.
@; This output format always starts with or ends with atleast 2 blank spaces.
@; Fully blank spaced lines follow each property value pair line.
@(define blank_spaces)
@/[ ]+/@(eol)
@(end)
@; All colons align at the same column position within the body of a report.
@; If that doesn't happen, that means there is nothing to capture,
@; which shouldn't happen.
@; This function should bind the appropriate position without updating
@; the line position.
@; Reports end when there is an empty line, so don't look past that.
@(define colon_position (column))
@(trailer)
@(gather :vars (column))
@(skip)@(chr column):@(skip)
@(until)
@(end)
@(end)
@; Capture values for a property. Values are always given on a single line.
@; If there is error information, it will be indicated by a ± character.#\x00B1
@(define capture (value error units))
@(cases)@value@\ ±@\ @error@\ @units@/[ ]+/@(eol)@\
@(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
@(end)
@(end)
Summary Report
@(collect :vars (report property value error units))
@report
@(forget k)
@(colon_position k)
@(cases)
@property@(chr k): @(capture value error units)@(blank_spaces)
@(ord)
@; Properties can span two lines. I have not seen any that span more.
@property_head@(chr k) @(blank_spaces)
@property_tail@(chr k): @(capture value error units)@(blank_spaces)
@(merge property property_head property_tail)
@(cat property " ")
@(end)
@(blank_spaces)
@(end)
Full Report Set
@(output)
report,property,value,error,units
@(repeat)
@report,@property,@value,@error,@units
@(end)
@(end)
After making some changes here and there, I'm now getting this output:
Code:
The trick with the colon actually works (nice application of
trailer
andchr
there). Where the code is tripped up is various small details. Misspelling@(or)
as@(orf)
, pattern functions that should be horizontal not using the proper@\
line continuations, and incorrectness in the@(blank_spaces)
causing it to want to consume some spaces unconditionally, spurious whitespace before@(merge)
and such.Also, the main problem is that the data is doubly nested, so we need a collect within a collect. We also need proper
@(until)
termination patterns. For the inner collect, I chose two blank lines; that seems to be what terminates the sections (it works for the data sample). The outer collect is terminated on theFull Report Set
, but that is not strictly necessary.To go with the nested collection, we use a nested repeat in the output.
I applied some indentation. Horizontal functions can use whitespace indentation because leading whitespace after line continuations is ignored.
The
@(forget k)
is gone; there is nok
in the scope there. Each iteration of the surrounding collect will freshly bindk
in an environment that is devoid ofk
.Addendum: here is a diff against the code for making it more robust against unexpected data. As it is, the inner
@(collect)
will silently skip over nonmatching elements, which means that if the file contains elements that do not conform to the expected cases, they will be ignored. This behavior is already being taken advantage of: it is why the blank lines between the data items are ignored. We can tighten that with a:gap 0
(collected regions must be consecutive) and handling the blank lines as a case. A fallback case can then diagnose an input lines as unrecognized: