The problem
I want to create a parser which matches strings like
"alpha beta 123 cpart"
----^----- -^- --^--
A: B: C:
alphanums num alpha
However the B part should only match when it appears in the second half of the string (i.e. to the 'right' of the string's mid-point).
So the above sample string should be parsed into the parts:
A: ['alpha', 'beta']
B: '123'
C: ['cpart']
But the string "123 alpha beta cpart"
should be parsed into:
A: '123 alpha beta cpart'
B: ''
C: ''
First approximation with pyparsing
As a starting point with pyparsing
I tried to use the matchOnlyAtCol
function (thinking I could later provide a modified version which accepts a range instead of a single column). However I got stuck on some strange behaviour of matchOnlyAtCol
. Here my demo code:
b_only_near_end = pp.Word(pp.nums)\
.setParseAction(pp.matchOnlyAtCol(12))('B')
a = pp.ZeroOrMore(pp.Word(pp.alphanums), stopOn=b_only_near_end)('A')
c = pp.ZeroOrMore(pp.Word(pp.alphas))('C')
expr = a + pp.Optional(b_only_near_end) + c
1) When I feed the first sample string "alpha beta 123 cpart"
into expr
's ParseString
I get the expected result
A: ['alpha', 'beta']
B: '123'
C: ['cpart']
because B starts exactly on column 12.
2) When I feed it the second string "123 alpha beta cpart"
(part B on column 1) I get
ParseException:
Expected end of text (at char 0), (line:1, col:1)
">!<123 alpha beta cpart"
Why? b_only_near_end
should not match at all and therefore not stop the expression a
, so I expect that a
eats up all characters and I don't expect an exception to bubble up, because all parts are somehow optional (either via the Optional
class or via the ZeroOrMore
construct).
Update: What match debugging reveals
I switched on debugging via setDebug()
for the ZeroOrMore
elements via the following expression code:
b_word = pp.Word(pp.nums).setName('_B_word_')
b_word.setDebug()
b_only_near_end = b_word\
.setParseAction(pp.matchOnlyAtCol(12))('B')
a_word = pp.Word(pp.alphanums).setName('_A_word_')
a_word.setDebug()
a = pp.ZeroOrMore(a_word, stopOn=b_only_near_end).setName('__A__')('A')
a.setDebug()
c_word = pp.Word(pp.alphas).setName('_C_word_')
c_word.setDebug()
c = pp.ZeroOrMore(c_word).setName('__C__')('C')
c.setDebug()
expr = a + pp.Optional(b_only_near_end) + c
1) When feeding in the string "alpha beta 123 cpart"
I get as debug output:
Match __A__ at loc 0(1,1)
Match _B_word_ at loc 0(1,1)
Exception raised:Expected _B_word_ (at char 0), (line:1, col:1)
Match _A_word_ at loc 0(1,1)
Matched _A_word_ -> ['alpha']
Match _B_word_ at loc 5(1,6)
Exception raised:Expected _B_word_ (at char 6), (line:1, col:7)
Match _A_word_ at loc 5(1,6)
Matched _A_word_ -> ['beta']
Match _B_word_ at loc 10(1,11)
Matched _B_word_ -> ['123']
Matched __A__ -> ['alpha', 'beta']
Match _B_word_ at loc 11(1,12)
Matched _B_word_ -> ['123']
Match __C__ at loc 14(1,15)
Match _C_word_ at loc 15(1,16)
Matched _C_word_ -> ['cpart']
Match _C_word_ at loc 20(1,21)
Exception raised:Expected _C_word_ (at char 20), (line:1, col:21)
Matched __C__ -> ['cpart']
2) With the string "123 alpha beta cpart"
the output is:
Match __A__ at loc 0(1,1)
Match _B_word_ at loc 0(1,1)
Matched _B_word_ -> ['123']
Matched __A__ -> []
Match _B_word_ at loc 0(1,1)
Exception raised:matched token not at column 12 (at char 0), (line:1, col:1)
Match __C__ at loc 0(1,1)
Match _C_word_ at loc 0(1,1)
Exception raised:Expected _C_word_ (at char 0), (line:1, col:1)
Matched __C__ -> []
plus the ParseException:
Expected end of text (at char 0), (line:1, col:1)
">!<123 alpha beta cpart"
So this means part A matches the beginning of the string - with empty match result, because a_word
does not match - So I guess I have to make A more greedy, but how?
The strange thing is that
Matched __A__ -> []
occurs before
Match _B_word_ at loc 0(1,1)
Exception raised:matched token not at column 12 (at char 0), (line:1, col:1)
A should "wait" with the match result longer, but how can i force it to do so?
Maybe the whole approach is not fruitful? Is there another way to achieve matching only in the second part of a string?
1) In the code of First approximation with pyparsing look at the line
When attaching the parse action, set the
callDuringTry
option:Then
matchOnlyAtCol
will be checked "during lookaheads and alternate testing" (quoted from the docu) as well. Without the option this does not happen!2) In order to address the title question "Match an element only when it appears in the right half of the parsing string" (spelt out under The Problem) define a function:
and use it as the parse action: