Escaping quotes in cl-ppcre regex

220 views Asked by At

Background

I need to parse CSV files, and cl-csv et. al. are too slow on large files, and have a dependency on cl-unicode, which my preferred lisp implementation does not support. So, I am improving cl-simple-table, one that Sabra-on-the-hill benchmarked as the fastest csv reader in a review.

At the moment, simple-table's line parser is rather fragile, and it breaks if the separator character appears within a quoted string. I'm trying to replace the line parser with cl-ppcre.

Attempts

Using the Regex Coach, I've found a regex that works in almost all cases:

("[^"]+"|[^,]+)(?:,\s*)?

The challenge is getting this Perl regex string into something I can use in cl-ppcre to split the line. I have tried passing the regex string, with various escapes for the ":

(defparameter bads "\"AER\",\"BenderlyZwick\",\"Benderly and Zwick Data: Inflation, Growth and Stock returns\",31,5,0,0,0,0,5,\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\",\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"
"Bad string, note a separator character in the quoted field, near Inflation")

(ppcre:split "(\"[^\"]+\"|[^,]+)(?:,\s*)?" bads)
NIL

Neither single, double, triple nor quadruple \ work.

I've parsed the string to see what the parse tree looks like:

(ppcre:parse-string "(\"[^\"]+\"|[^,]+)(?:,s*)?")
(:SEQUENCE (:REGISTER (:ALTERNATION (:SEQUENCE #\" (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\")) #\") (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,)))) (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE #\, (:GREEDY-REPETITION 0 NIL #\s)))))

and passed the resulting tree to split:

(ppcre:split '(:SEQUENCE (:REGISTER (:ALTERNATION (:SEQUENCE #\" (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\")) #\") (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,)))) (:GREEDY-REPETITION 0 1 (:GROUP (:SEQUENCE #\, (:GREEDY-REPETITION 0 NIL #\s))))) bads)
NIL

I also tried various forms of *allow-quoting*:

 (let ((ppcre:*allow-quoting* t))
  (ppcre:split "(\\Q\"\\E[^\\Q\"\\E]+\\Q\"\\E|[^,]+)(?:,\s*)?" bads))

I've read through the cl-ppcre docs, but there are very few examples of using parse trees, and no examples of escaping quotes.

Nothing seems to work.

I was hoping that the Regex Coach would provide a way to see the S-expression parse tree form of the Perl syntax string. That would be a very useful feature, allowing you to experiment with the regex string and then copy & paste the parse tree in Lisp code.

Does anyone know how to escape quotes in this example?

1

There are 1 answers

2
coredump On BEST ANSWER

In this answer I focus on the errors in your code and try to explain how you could make it work. As explained by @Svante, this might not be the best course of actions for your use-case. In particular, your regex might be too tailored for your known test inputs and might miss cases that could arise later.

For example, your regex consider fields as either strings delimited by double-quotes with no inner double-quotes (even escaped), or a sequence of characters different from the comma. If, however, your field starts with a normal letter and then contains a double quote, it will be part of the field name.

Fixing the test string

Maybe there was a problem when formatting your question, but the form introducing bads is malformed. Here is a fixed definition for *bads* (notice the asterisks around the special variable, this is a useful convention that helps distinguish them from lexical variables (asterisks around the names are also known as "earmuffs")):

(defparameter *bads*
  "\"AER\",\"BenderlyZwick\",\"Benderly and Zwick Data: Inflation, Growth and Stock returns\",31,5,0,0,0,0,5,\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\",\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"")

Escape characters in regex

The parse tree you obtain contains this:

(... (:GREEDY-REPETITION 0 NIL #\s) ...)

There is a literal character #\s in your parse-tree. To understand why, let's define two auxiliary functions:

(defun chars (string)
  "Convert a string to a list of char names"
  (map 'list #'char-name string))

(defun test (s)
  (list :parse (chars s)
        :as (ppcre:parse-string s)))

For example, here is how the different strings below are parsed:

(test "s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)

(test "\s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)

(test "\\s")
=> (:PARSE ("REVERSE_SOLIDUS" "LATIN_SMALL_LETTER_S")
    :AS :WHITESPACE-CHAR-CLASS)

Only in the last case, where the backslash (reverse solidus) is escaped, the PPCRE parser sees both this backslash and the next character #\s and interprets this sequence as :WHITESPACE-CHAR-CLASS. The Lisp reader interprets \s as s, because it is not part of the characters that can be escaped in Lisp.

I tend to work with parse tree directly because a lot of headaches w.r.t. escaping goes away (and in my opinion this is exacerbated with \Q and \E). A fixed parse tree is for example the following one, where I replaced the #\s by the desired keyword and removed the :register nodes that were not useful:

 (:sequence
   (:alternation
    (:sequence #\"
     (:greedy-repetition 1 nil
      (:inverted-char-class #\"))
     #\")
    (:greedy-repetition 1 nil (:inverted-char-class #\,)))
   (:greedy-repetition 0 1
    (:group
     (:sequence #\,
      (:greedy-repetition 0 nil :whitespace-char-class)))))

Why the result is NIL

Remember that you are trying to split the string with this regex, but the regex actually describes a field and the following comma. The reason you have a NIL result is because your string is just a sequence of separators, like this example:

(split #\, ",,,,,,")
NIL

With a simpler example, you can see that splitting words as separators give:

(split "[a-z]+" "abc0def1z3")
=> ("" "0" "1" "3")

But if the separators also include digits, then the result is NIL:

(split "[a-z0-9]+" "abc0def1z3")
=> NIL

Looping over fields

With the regex you defined, it is easier to use do-register-groups. It is a loop construct that iterates over the string by trying to match the regex successively on the string, binding each (:register ...) in the regex to a variable.

If you put (:register ...) around the first (:alternation ...), you will sometimes capture the double quotes (first branch of the alternation):

(do-register-groups (field)
    ('(:SEQUENCE
       (:register
        (:ALTERNATION
         (:SEQUENCE #\"
          (:GREEDY-REPETITION 1 NIL
           (:INVERTED-CHAR-CLASS #\"))
          #\")
         (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
       (:GREEDY-REPETITION 0 1
        (:GROUP
         (:SEQUENCE #\,
          (:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
     *bads*)
  (print field))

"\"AER\"" 
"\"BenderlyZwick\"" 
"\"Benderly and Zwick Data: Inflation, Growth and Stock returns\"" 
"31" 
"5" 
"0" 
"0" 
"0" 
"0" 
"5" 
"\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\"" 
"\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"" 

Another option is to add two :register nodes, one for each branch of the alternation; that means binding two variables, one of them being NIL for each successful match:

(do-register-groups (quoted simple)
    ('(:SEQUENCE
       (:ALTERNATION
        (:SEQUENCE #\"
         (:register ;; <- quoted (first register)
          (:GREEDY-REPETITION 1 NIL
           (:INVERTED-CHAR-CLASS #\")))
         #\")
        (:register ;; <- simple (second register)
         (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
       (:GREEDY-REPETITION 0 1
        (:GROUP
         (:SEQUENCE #\,
          (:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
     *bads*)
  (print (or quoted simple)))

"AER" 
"BenderlyZwick" 
"Benderly and Zwick Data: Inflation, Growth and Stock returns" 
"31" 
"5" 
"0" 
"0" 
"0" 
"0" 
"5" 
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv" 
"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html" 

Inside the loop you could push each field into a list or a vector to be processed later.