TXR: Parsing summary reports containing unicode with a more complicated syntax using functions

71 views Asked by At

I'm trying to parse a "summary" region of a bunch of computer reports, where the report names and their associated variables changes from file to file. I give a made up example following the format below:

 Summary Report


       Bath Tub

  Temperature:    30 °C       

  Water ready                 
       volume:    200000 cm³  


    Bath Room

   Floor Area:    40 ft²      

  Door Height:    9 ± 0.1 ft  



Full Report Set

It's hard to see from the above what the white space looks like, so here is a screenshot of my text editor with visible white space.

dummy report summary file screenshot

The region of interest starts with Summary Report and ends with Full Report Set. Properties can potentially span two lines. The property names are aligned such that the colon : stays at the same character position within each sub-report.

From the diagnostic output, it appears my attempt to exploit this fact is not working.

txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 11 vs. k) txr: (src/generic-micrometrics-report.txr:36) variable k binding mismatch (13 vs. 12) txr: (src/generic-micrometrics-report.txr:36) chr mismatch (position 12 vs. k) txr: (src/generic-micrometrics-report.txr:36) string matched, position 13-18 (data/dummy-generic-report.txt:6) txr: (src/generic-micrometrics-report.txr:36) Temperature: 30 °C
txr: (src/generic-micrometrics-report.txr:36) ^ ^ txr: (src/generic-micrometrics-report.txr:23) spec ran out of data txr: (source location n/a) function (capture (nil (k . 13) (report . "Bath Tub"))) failed

I've included the code below. Can you explain why this code does not work? Am I doing what I think I'm doing with the colon_position function? If so, why is it failing? How would you write the capture function? Is this the general approach you would take? Is there a better way? Thanks so much for all your help and advice.

@; This output format always starts with or ends with atleast 2 blank spaces.
@; Fully blank spaced lines follow each property value pair line.
@(define blank_spaces)
  @/[ ]+/@(eol)
@(end)
@; All colons align at the same column position within the body of a report.
@; If that doesn't happen, that means there is nothing to capture,
@; which shouldn't happen.
@; This function should bind the appropriate position without updating
@; the line position.
@; Reports end when there is an empty line, so don't look past that.
@(define colon_position (column))
@(trailer)
@(gather :vars (column))
@(skip)@(chr column):@(skip)
@(until)

@(end)
@(end)
@; Capture values for a property. Values are always given on a single line.
@; If there is error information, it will be indicated by a ± character.#\x00B1
@(define capture (value error units))
@(cases)@value@\ ±@\ @error@\ @units@/[ ]+/@(eol)@\
@(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
@(end)
@(end)
 Summary Report

@(collect :vars (report property value error units))

 @report

@(forget k)
@(colon_position k)
@(cases)
 @property@(chr k):    @(capture value error units)@(blank_spaces)
@(ord)
@; Properties can span two lines. I have not seen any that span more.
 @property_head@(chr k)     @(blank_spaces)
 @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
 @(merge property property_head property_tail)
 @(cat property " ")
@(end)
@(blank_spaces)
@(end)


Full Report Set
@(output)
report,property,value,error,units
@(repeat)
@report,@property,@value,@error,@units
@(end)
@(end)
1

There are 1 answers

12
Kaz On BEST ANSWER

After making some changes here and there, I'm now getting this output:

report,property,value,error,units
Bath Tub,Temperature,30,,°C
Bath Tub,Water ready volume,200000,,cm³
Bath Room,Floor Area,40,,ft²
Bath Room,Door Height,9,0.1,ft

Code:

@; This output format always starts with or ends with atleast 2 blank spaces.
@; Fully blank spaced lines follow each property value pair line.
@(define blank_spaces)@\
@/[ ]*/@(eol)@\
@(end)
@; All colons align at the same column position within the body of a report.
@; If that doesn't happen, that means there is nothing to capture,
@; which shouldn't happen.
@; This function should bind the appropriate position without updating
@; the line position.
@; Reports end when there is an empty line, so don't look past that.
@(define colon_position (column))
@  (trailer)
@  (gather :vars (column))
@  (skip)@(chr column):@(skip)
@(until)

@(end)
@(end)
@; Capture values for a property. Values are always given on a single line.
@; If there is error information, it will be indicated by a ± character.#\x00B1
@(define capture (value error units))@\
  @(cases)@value@\ ±@\ @error@\ @units @(eol)@\
  @(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
  @(end)@\
@(end)
 Summary Report

@(collect :vars (report property value error units))

 @report

@  (colon_position k)
@  (collect)
@    (cases)
 @property@(chr k):    @(capture value error units)@(blank_spaces)
@    (or)
@; Properties can span two lines. I have not seen any that span more.
 @property_head@(chr k)     @(blank_spaces)
 @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
@      (merge property property_head property_tail)
@      (cat property " ")
@    (end)
@  (until)


@  (end)
@(until)
Full Report Set
@(end)
@(output)
report,property,value,error,units
@  (repeat)
@    (repeat)
@report,@property,@value,@error,@units
@    (end)
@  (end)
@(end)

The trick with the colon actually works (nice application of trailer and chr there). Where the code is tripped up is various small details. Misspelling @(or) as @(orf), pattern functions that should be horizontal not using the proper @\ line continuations, and incorrectness in the @(blank_spaces) causing it to want to consume some spaces unconditionally, spurious whitespace before @(merge) and such.

Also, the main problem is that the data is doubly nested, so we need a collect within a collect. We also need proper @(until) termination patterns. For the inner collect, I chose two blank lines; that seems to be what terminates the sections (it works for the data sample). The outer collect is terminated on the Full Report Set, but that is not strictly necessary.

To go with the nested collection, we use a nested repeat in the output.

I applied some indentation. Horizontal functions can use whitespace indentation because leading whitespace after line continuations is ignored.

The @(forget k) is gone; there is no k in the scope there. Each iteration of the surrounding collect will freshly bind k in an environment that is devoid of k.


Addendum: here is a diff against the code for making it more robust against unexpected data. As it is, the inner @(collect) will silently skip over nonmatching elements, which means that if the file contains elements that do not conform to the expected cases, they will be ignored. This behavior is already being taken advantage of: it is why the blank lines between the data items are ignored. We can tighten that with a :gap 0 (collected regions must be consecutive) and handling the blank lines as a case. A fallback case can then diagnose an input lines as unrecognized:

diff --git a/extract.txr b/extract.txr
index 8c93d89..3d1fac6 100644
--- a/extract.txr
+++ b/extract.txr
@@ -24,6 +24,7 @@
   @(or)@value@\ @units@/[ ]+/@(eol)@(bind error "")@\
   @(end)@\
 @(end)
+@(name file)
  Summary Report

 @(collect :vars (report property value error units))
@@ -31,7 +32,7 @@
  @report

 @  (colon_position k)
-@  (collect)
+@  (collect :gap 0)
 @    (cases)
  @property@(chr k):    @(capture value error units)@(blank_spaces)
 @    (or)
@@ -40,6 +41,12 @@
  @property_tail@(chr k):    @(capture value error units)@(blank_spaces)
 @      (merge property property_head property_tail)
 @      (cat property " ")
+@    (or)
+
+@    (or)
+@      (line ln)
+@      badline
+@      (throw error `@file:@ln unrecognized syntax: @badline`)
 @    (end)
 @  (until)