Correcting the regex "\[([a-zA-Z0-9_-]+)]"

842 views Asked by At

The following cl-ppcre regular expression generates an error:

(ppcre:scan-to-strings "\[([a-zA-Z0-9_-]+)]" "[has-instance]")

debugger invoked on a CL-PPCRE:PPCRE-SYNTAX-ERROR in thread
#<THREAD "main thread" RUNNING {10010B0523}>:
  Expected end of string. at position 16 in string "[([a-zA-Z0-9_-]+)]"

What I was expecting as return values is:

“[has-instance]”
#(“has-instance”)

in order to get at the string within the brackets. Can someone provide a regex correction? Thanks.

2

There are 2 answers

0
coredump On BEST ANSWER

The escape character (backslash) only escapes itself and double quotes (§2.4.5 Double-Quote):

If a single escape character is seen, the single escape character is discarded, the next character is accumulated, and accumulation continues.

That means that:

 "\[([a-zA-Z0-9_-]+)]" 

is parsed the same as the following, where backslash is not present:

 "[([a-zA-Z0-9_-]+)]"

The PCRE syntax implemented by CL-PPCRE understands the opening square bracket as a special syntax for character classes, and ends at the next closing bracket. Thus, the above reads the following as a class:

[([a-zA-Z0-9_-]

The corresponding regex tree is:

CL-USER> (ppcre:parse-string "[([a-zA-Z0-9_-]")
(:CHAR-CLASS #\( #\[ (:RANGE #\a #\z) (:RANGE #\A #\Z) (:RANGE #\0 #\9) #\_ #\-)

Note in particular that the opening parenthesis inside it is treated literally. When the parser encounters the closing parenthesis that follows the above fragment, it interprets it as the end of a register group, but no such group was started, hence the error message at position 16 of the string.

To avoid treating the bracket as a character class, it must be preceded by a literal backslash in the string, as you tried to do, but in order to do so you must write two backslash characters:

CL-USER> (ppcre:parse-string "\\[([a-zA-Z0-9_-]+)]")
(:SEQUENCE #\[
 (:REGISTER
  (:GREEDY-REPETITION 1 NIL
   (:CHAR-CLASS (:RANGE #\a #\z) (:RANGE #\A #\Z) (:RANGE #\0 #\9) #\_ #\-)))
 #\])

The closing square brackets needs no backslash.

I encourage you to write regular expressions in Lisp using the tree form, with :regex terms when it improves clarity: it avoids having to deal with the kind of problems that escaping brings. For example:

CL-USER> (ppcre:scan-to-strings 
           '(:sequence "[" (:register (:regex "[a-zA-Z0-9_-]+")) "]")
           "[has-instance]")
"[has-instance]"
#("has-instance")
0
Gwang-Jin Kim On
  1. Double escape the square brackets.
  2. You forgot to (double) escape the closing bracket, too.
(cl-ppcre:scan-to-strings "\\[([a-zA-Z0-9_-]+)\\]" "[has-instance]")
;; "[has-instance]" ;
;; #("has-instance")

For those who are new to common lisp, you import cl-ppcre using quicklisp:

(load "~/quicklisp/setup.list") ;; adjust path to where you installed your quicklisp
(ql:quickload :cl-ppcre)