Match an element only when it appears in the right half of the parsing string

115 views Asked by At

The problem

I want to create a parser which matches strings like

"alpha beta 123 cpart"

 ----^----- -^- --^--
     A:      B:   C:
 alphanums  num alpha 

However the B part should only match when it appears in the second half of the string (i.e. to the 'right' of the string's mid-point).

So the above sample string should be parsed into the parts:

A: ['alpha', 'beta']
B: '123'
C: ['cpart']

But the string "123 alpha beta cpart" should be parsed into:

A: '123 alpha beta cpart'
B: ''
C: ''

First approximation with pyparsing

As a starting point with pyparsing I tried to use the matchOnlyAtCol function (thinking I could later provide a modified version which accepts a range instead of a single column). However I got stuck on some strange behaviour of matchOnlyAtCol. Here my demo code:

b_only_near_end = pp.Word(pp.nums)\
                    .setParseAction(pp.matchOnlyAtCol(12))('B')
a = pp.ZeroOrMore(pp.Word(pp.alphanums), stopOn=b_only_near_end)('A')
c = pp.ZeroOrMore(pp.Word(pp.alphas))('C')
expr = a + pp.Optional(b_only_near_end) + c

1) When I feed the first sample string "alpha beta 123 cpart" into expr's ParseString I get the expected result

A: ['alpha', 'beta']
B: '123'
C: ['cpart']

because B starts exactly on column 12.

2) When I feed it the second string "123 alpha beta cpart" (part B on column 1) I get

ParseException:
Expected end of text (at char 0), (line:1, col:1)
">!<123 alpha beta cpart"

Why? b_only_near_end should not match at all and therefore not stop the expression a, so I expect that a eats up all characters and I don't expect an exception to bubble up, because all parts are somehow optional (either via the Optional class or via the ZeroOrMore construct).


Update: What match debugging reveals

I switched on debugging via setDebug() for the ZeroOrMore elements via the following expression code:

b_word = pp.Word(pp.nums).setName('_B_word_')
b_word.setDebug()
b_only_near_end = b_word\
                    .setParseAction(pp.matchOnlyAtCol(12))('B')
a_word = pp.Word(pp.alphanums).setName('_A_word_')
a_word.setDebug()
a = pp.ZeroOrMore(a_word, stopOn=b_only_near_end).setName('__A__')('A')
a.setDebug()

c_word = pp.Word(pp.alphas).setName('_C_word_')
c_word.setDebug()
c = pp.ZeroOrMore(c_word).setName('__C__')('C')
c.setDebug()

expr = a + pp.Optional(b_only_near_end) + c

1) When feeding in the string "alpha beta 123 cpart" I get as debug output:

Match __A__ at loc 0(1,1)
Match _B_word_ at loc 0(1,1)
Exception raised:Expected _B_word_ (at char 0), (line:1, col:1)
Match _A_word_ at loc 0(1,1)
Matched _A_word_ -> ['alpha']
Match _B_word_ at loc 5(1,6)
Exception raised:Expected _B_word_ (at char 6), (line:1, col:7)
Match _A_word_ at loc 5(1,6)
Matched _A_word_ -> ['beta']
Match _B_word_ at loc 10(1,11)
Matched _B_word_ -> ['123']
Matched __A__ -> ['alpha', 'beta']
Match _B_word_ at loc 11(1,12)
Matched _B_word_ -> ['123']
Match __C__ at loc 14(1,15)
Match _C_word_ at loc 15(1,16)
Matched _C_word_ -> ['cpart']
Match _C_word_ at loc 20(1,21)
Exception raised:Expected _C_word_ (at char 20), (line:1, col:21)
Matched __C__ -> ['cpart']

2) With the string "123 alpha beta cpart" the output is:

Match __A__ at loc 0(1,1)
Match _B_word_ at loc 0(1,1)
Matched _B_word_ -> ['123']
Matched __A__ -> []
Match _B_word_ at loc 0(1,1)
Exception raised:matched token not at column 12 (at char 0), (line:1, col:1)
Match __C__ at loc 0(1,1)
Match _C_word_ at loc 0(1,1)
Exception raised:Expected _C_word_ (at char 0), (line:1, col:1)
Matched __C__ -> []

plus the ParseException:

Expected end of text (at char 0), (line:1, col:1)
">!<123 alpha beta cpart"

So this means part A matches the beginning of the string - with empty match result, because a_word does not match - So I guess I have to make A more greedy, but how?

The strange thing is that

Matched __A__ -> []

occurs before

Match _B_word_ at loc 0(1,1)
Exception raised:matched token not at column 12 (at char 0), (line:1, col:1)

A should "wait" with the match result longer, but how can i force it to do so?

Maybe the whole approach is not fruitful? Is there another way to achieve matching only in the second part of a string?

1

There are 1 answers

1
halloleo On BEST ANSWER

1) In the code of First approximation with pyparsing look at the line

b_only_near_end = pp.Word(pp.nums)\
                  .setParseAction(pp.matchOnlyAtCol(12))('B')

When attaching the parse action, set the callDuringTry option:

b_only_near_end = pp.Word(pp.nums)\
                  .setParseAction(pp.matchOnlyAtCol(12), 
                                  callDuringTry=True))('B')

Then matchOnlyAtCol will be checked "during lookaheads and alternate testing" (quoted from the docu) as well. Without the option this does not happen!

2) In order to address the title question "Match an element only when it appears in the right half of the parsing string" (spelt out under The Problem) define a function:

def matchOnlyInRightHalf():
    """
    Helper method for defining parse actions that require matching in the
    right half of the parse string.
    """
    def verifyInRightHalf(strg,locn,toks):
        col = pp.col(locn,strg)
        middle = len(strg) // 2
        if not (col> middle):
            raise pp.ParseException(strg, locn,
                                "matched token not in right half of string")
    return verifyInRightHalf

and use it as the parse action:

b_only_near_end = b_word.setParseAction(matchOnlyInRightHalf(),
                                        callDuringTry=True)