Parsimonious parsing of comma-separated string array with empties

40 views Asked by At

I have the following string that I'm trying to parse with parsimonious:

(,,"My","Cool",,"Array",,,)

This is a string array where it's possible for each entry to be empty (I want to represent them with None).

I've taken a number of runs at it, the closest I've gotten to something that "works" is this:

string = ~'"[^\"]+"'
comma = ","
array = "(" (comma / string)* ")"

Then I would just use the visitor functions to assemble the array based on what nodes are encountered in the array values. This will work for correctly formatted files, but the following would still be "valid", according to the grammar:

("My""Cool""Array")

In my experience, PEG parsers are a little tricky to work with when parsing arrays like this. Is there a set of rules I could use that would handle this correctly? Ideally I could detect the error during the parse, and not during the AST traversal.

1

There are 1 answers

0
InSync On

I don't know Parsimonious, but a PCRE regex that matches the format you want would look like this:

(?(DEFINE)              # Define some reusable groups
  (?<string>"[^"]+")
)

\(                      # Match an opening parenthesis
  (?&string)?           # followed by a string, or empty,
  (?:,(?&string)?)*     # then 0 or more similar constructs preceded by a comma
\)                      # right before a closing parenthesis.

Try it on regex101.com.

A quick attempt to convert it to a Parsimonious grammar seems to work:

from parsimonious import Grammar

grammar = Grammar('''
  array = "(" string? (comma string?)* ")"
  string = ~'"[^\"]+"'
  comma = ","
''')
grammar.parse('("My","Cool","Array")')        # Pass
grammar.parse('("My","Cool","Array",)')       # Pass
grammar.parse('(,,"My","Cool",,"Array",,,)')  # Pass
grammar.parse('("My""Cool""Array")')          # Error