I have data with the following text output from a computer program:
[Coordination geometry type : Single neighbor (IUCr: [1l])
- coordination number : 1
------------------------------------------------------------,
Coordination geometry type : Linear (IUPAC: L-2 || IUCr: [2l])
- coordination number : 2
------------------------------------------------------------,
Coordination geometry type : Anticuboctahedron (IUCr: [12aco])
- coordination number : 12
------------------------------------------------------------,
Coordination geometry type : Square cupola
- coordination number : 12
------------------------------------------------------------,
Coordination geometry type : Hexagonal prism (IUPAC: HPR-12 || IUCr: [12p])
- coordination number : 12
------------------------------------------------------------,
Coordination geometry type : Hexagonal antiprism (IUPAC: HAPR-12)
- coordination number : 12
------------------------------------------------------------,
Coordination geometry type : Square-face capped hexagonal prism
- coordination number : 13
------------------------------------,------------------------,
Coordination geometry type : Unknown environment
- coordination number : None
------------------------------------------------------------,
Coordination geometry type : Unclear environment
- coordination number : None
------------------------------------------------------------]
I'd like to write a txr query that captures this data and produces the following CSV table:
cnum,geom,IUPAC,IUCr
1,single neighbor,,[1l]
2,linear,L-2,
2,linear,,[2l]
12,anticuboctahedron,,[12aco]
12,square cupola,,
12,hexagonal prism,HPR-12,
12,hexagonal prism,,[12p]
12,hexagonal antiprism,HAPR-12,
13,square-face capped hexagonal prism,,
,unknown environment,,
,unclear environment,,
I think the query should be something like:
@(collect :vars (cnum geom IUPAC IUCr))
@/ |\[/Coordination geometry type : @geom @(maybe)(IUPAC: @IUPAC@(maybe) || IUCr: @IUCr(@end))@(end)
- coordination number : @cnum
------------------------------------------------------------,
@(last)
------------------------------------------------------------]
@(end)
@(output)
cnum,geom,IUPAC,IUCr
@(repeat)
@cnum,@(do (downcase-str @geom)),@IUPAC,@IUCr
@(end)
@(end)
But there are several syntax errors, and confusions I face. Specifically:
Since this output is a presented as a bracketed list, it's not clear to me how to capture the first line. I thought maybe a regular expression would do the trick, but that doesn't seem very efficient. Also, it wasn't clear to me whether this was the right syntax. Should the [
be escaped? Can I use a @(first)
as is mentioned with the @(repeat)
directive instead to handle this case?
IUPAC
and IUCr
are optional variables existing within parentheses separated by ||
. I think these might be captured by @(maybe)
, but I'm not sure exactly how that directive works and whether it might be nested. If neither are specified, then there will be no parenthesis (as with square cupola).
In the interest of providing more examples to myself and others for learning this powerful tool, I thought I might post this question here.
Try this:
The syntax errors in your original code are caused by the
(@end)
typo; fix that andtxr
accepts it. Unfortunately, it doesn't work.The first difference is that in the
collect
:vars
, we give the optional variables default values. Without these, failure to bind them in those cases when they do not occur will trigger a failure.Yes,
maybe
does nest and we can solve it that way. However, at the cost of a bit of repetition, the line-orientedcases
handles the cases in a way that is easily readable.Note how the order of the cases is important.
We don't match the
...----,
separator lines. Doing so only prevents us from matching the last record which ends with...---]
. Note that thelast
clause specifies an alternative pattern for the entirecollect
body, not just an alternative for the last line. We retain it to provide the termination test. It might not be necessary at all. We need this test if the data is followed by other material that could add false positive matches, or waste significant time.In the
output
, we cannot usedo
; there is no such directive. (Thedo
is in fact interpreted as Lisp there). I used an output filter to downcase the geometry. If@(downcase-str geom)
is used, we then need:vars (geom)
in therepeat
, becausegeom
then occurs only in a Lisp expression, andrepeat
, being fairly naive, doesn't analyze Lisp code for variables that are to be auto-iterated.Here is a way to reduce the duplication in the cases, using a pattern function to keep things tidy: