How can I capture this computer output with a txr query?

58 views Asked by At

I have data with the following text output from a computer program:

[Coordination geometry type : Single neighbor (IUCr: [1l])

   - coordination number : 1
 ------------------------------------------------------------,
 Coordination geometry type : Linear (IUPAC: L-2 || IUCr: [2l])

   - coordination number : 2
 ------------------------------------------------------------,
 Coordination geometry type : Anticuboctahedron (IUCr: [12aco])

   - coordination number : 12
 ------------------------------------------------------------,
 Coordination geometry type : Square cupola

   - coordination number : 12
 ------------------------------------------------------------,
 Coordination geometry type : Hexagonal prism (IUPAC: HPR-12 || IUCr: [12p])

   - coordination number : 12
 ------------------------------------------------------------,
 Coordination geometry type : Hexagonal antiprism (IUPAC: HAPR-12)

   - coordination number : 12
 ------------------------------------------------------------,
 Coordination geometry type : Square-face capped hexagonal prism

   - coordination number : 13
 ------------------------------------,------------------------,
 Coordination geometry type : Unknown environment

   - coordination number : None
 ------------------------------------------------------------,
 Coordination geometry type : Unclear environment

   - coordination number : None
 ------------------------------------------------------------]

I'd like to write a txr query that captures this data and produces the following CSV table:

cnum,geom,IUPAC,IUCr
1,single neighbor,,[1l]
2,linear,L-2,
2,linear,,[2l]
12,anticuboctahedron,,[12aco]
12,square cupola,,
12,hexagonal prism,HPR-12,
12,hexagonal prism,,[12p]
12,hexagonal antiprism,HAPR-12,
13,square-face capped hexagonal prism,,
,unknown environment,,
,unclear environment,,

I think the query should be something like:

@(collect :vars (cnum geom IUPAC IUCr))
@/ |\[/Coordination geometry type : @geom @(maybe)(IUPAC: @IUPAC@(maybe) || IUCr: @IUCr(@end))@(end)

   - coordination number : @cnum
 ------------------------------------------------------------,   
@(last)
 ------------------------------------------------------------] 
@(end)
@(output)
cnum,geom,IUPAC,IUCr
@(repeat)
@cnum,@(do (downcase-str @geom)),@IUPAC,@IUCr
@(end)
@(end)

But there are several syntax errors, and confusions I face. Specifically:

Since this output is a presented as a bracketed list, it's not clear to me how to capture the first line. I thought maybe a regular expression would do the trick, but that doesn't seem very efficient. Also, it wasn't clear to me whether this was the right syntax. Should the [ be escaped? Can I use a @(first) as is mentioned with the @(repeat) directive instead to handle this case?

IUPAC and IUCr are optional variables existing within parentheses separated by ||. I think these might be captured by @(maybe), but I'm not sure exactly how that directive works and whether it might be nested. If neither are specified, then there will be no parenthesis (as with square cupola).

In the interest of providing more examples to myself and others for learning this powerful tool, I thought I might post this question here.

1

There are 1 answers

2
Kaz On BEST ANSWER

Try this:

@(collect :vars (cnum geom (IUPAC "") (IUCr "")))
@  (cases)
@/[ \[]/Coordination geometry type : @geom (IUPAC: @IUPAC || IUCr: @IUCr)
@  (or)
@/[ \[]/Coordination geometry type : @geom (IUPAC: @IUPAC)
@  (or)
@/[ \[]/Coordination geometry type : @geom (IUCr: @IUCr)
@  (or)
@/[ \[]/Coordination geometry type : @geom
@  (end)

   - coordination number : @cnum
@(last)
 ------------------------------------------------------------]
@(end)
@(output)
cnum,geom,IUPAC,IUCr
@  (repeat)
@cnum,@{geom :filter :downcase},@IUPAC,@IUCr
@  (end)
@(end)

The syntax errors in your original code are caused by the (@end) typo; fix that and txr accepts it. Unfortunately, it doesn't work.

The first difference is that in the collect :vars, we give the optional variables default values. Without these, failure to bind them in those cases when they do not occur will trigger a failure.

Yes, maybe does nest and we can solve it that way. However, at the cost of a bit of repetition, the line-oriented cases handles the cases in a way that is easily readable.

Note how the order of the cases is important.

We don't match the ...----, separator lines. Doing so only prevents us from matching the last record which ends with ...---]. Note that the last clause specifies an alternative pattern for the entire collect body, not just an alternative for the last line. We retain it to provide the termination test. It might not be necessary at all. We need this test if the data is followed by other material that could add false positive matches, or waste significant time.

In the output, we cannot use do; there is no such directive. (The do is in fact interpreted as Lisp there). I used an output filter to downcase the geometry. If @(downcase-str geom) is used, we then need :vars (geom) in the repeat, because geom then occurs only in a Lisp expression, and repeat, being fairly naive, doesn't analyze Lisp code for variables that are to be auto-iterated.


Here is a way to reduce the duplication in the cases, using a pattern function to keep things tidy:

@(define formula (x y))@\
  @(cases) (IUPAC: @x || IUCr: @y)@\
  @(or) (IUPAC: @x)@\
  @(or) (IUCr: @y)@\
  @(or)@(eol)@\
  @(end)@\
@(end)
@(collect :vars (cnum geom (IUPAC "") (IUCr "")))
@/[ \[]/Coordination geometry type : @geom@(formula IUPAC IUCr)

   - coordination number : @cnum
@(last)
 ------------------------------------------------------------]
@(end)
@(output)
cnum,geom,IUPAC,IUCr
@  (repeat)
@cnum,@{geom :filter :downcase},@IUPAC,@IUCr
@  (end)
@(end)