Non-greedy list parsing with pyparsing

501 views Asked by At

I have a string consisting of a list of words which I am attempting to parse with pyparsing.

The list always has a minimum of three items. From this I want pyparsing to generate three groups, the first of which contains all of the words upto the last two items, and the last two groups should be the last two items. For example:

"one two three four"

should be parsed to something resembling:

["one two"], "three", "four"

I can do this with a Regex:

import pyparsing as pp
data = "one two three four"
grammar = pp.Regex(r"(?P<first>(\w+\W?)+)\s(?P<penultimate>\w+) (?P<ultimate>\w+)")
print(grammar.parseString(data).dump())

which gives:

['one two three four']
- first: one two
- penultimate: three
- ultimate: four

My problem is that I'm failing to get the same result with the non-Regex ParserElement's because of pyparsing greedy nature, for example the following:

import pyparsing as pp
data = "one two three four"
word = pp.Word(pp.alphas)
grammar = pp.Group(pp.OneOrMore(word))("first") + word("penultimate") + word("ultimate")
grammar.parseString(data)

fails with the traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/pyparsing.py", line 1125, in parseString
    raise exc
pyparsing.ParseException: Expected W:(abcd...) (at char 18), (line:1, col:19)

because OneOrMore slurps all of the words in the list. My attempts so far to prevent this greedy behaviour with FollowedBy or NotAny are failing - any suggestions as how I can get the desired behaviour?

1

There are 1 answers

1
PaulMcG On BEST ANSWER

Well, your OneOrMore expression just needs a little tightening up - you are on the right track with FollowedBy. You don't really want just OneOrMore(word), you want "OneOrMore(word that is followed at least 2 more words)". To add this kind of lookahead to pyparsing, you can even use the new '*' multiplication operator to specify the lookahead count:

grammar = pp.Group(pp.OneOrMore(word + pp.FollowedBy(word*2)))("first") + word("penultimate") + word("ultimate")

Now dumping this out gives the desired:

[['one', 'two'], 'three', 'four']
- first: ['one', 'two']
- penultimate: three
- ultimate: four