Matching end-of-line with CL-PPCRE

475 views Asked by At

I've a rather simple regex that works perfectly fine in my Ruby code but refuses to work in my Lisp code. I'm just trying to match a URL (slash followed by a word, and no more). Here's the regex I have that works in Ruby: ^\/\w*$

I'd like this to match "/" or "/foo" but not "/foo/bar"

I've tried the following:

(cl-ppcre:scan "^/\w*$" "/") ;works
(cl-ppcre:scan "^/\w*$" "/foo") ;doesn't work!
(cl-ppcre:scan "^/\w*$" "/foo/bar") ;works, ie doesn't match

Can someone help?

2

There are 2 answers

1
hans23 On BEST ANSWER

The backslash (\) character is, by default, the single escape character: It prevents any special processing to be done to the character following it, so it can be used to include a double quote (") inside of a string literal like this "\"".

Thus, when you pass the literal string "^/\w*$" to cl-ppcre:scan, the actual string that is passed will be "^/w*$", i.e. the backslash will just be removed. You can verify this by evaluating (cl-ppcre:scan "^/\w*$" "/w"), which will match.

To include the backslash character in your regular expression, you need to quote it like so: "^/\\w*$".

If you work with literal regular expressions a lot, the required quoting of strings can become tedious and hard to read. Have a look at CL-INTERPOL for a library that adds a nicer syntax for regular expressions to the Lisp reader.

0
coredump On

If you have a doubt about your regular expression, you can also check it with ppcre:parse-string:

CL-USER> (ppcre:parse-string "^/\w*$")
(:SEQUENCE :START-ANCHOR #\/ (:GREEDY-REPETITION 0 NIL #\w) :END-ANCHOR)

The returned value is a tree that represents a regular expression. You can in fact use the same representation anywhere CL-PPCRE expects a regular expression. The above tells us that backslash-w was interpreted as a literal w character.

Compare this with the expression you wanted to use:

CL-USER> (ppcre:parse-string "^/\\w*$")
(:SEQUENCE 
  :START-ANCHOR #\/ 
  (:GREEDY-REPETITION 0 NIL :WORD-CHAR-CLASS)
  :END-ANCHOR)

Even though it is somewhat verbose, the tree representation helps combining values into regexes, without having to worry about nesting strings or special characters inside strings. For example, here the regular expression is computed in a function before being used, without having to escape special characters:

(defun maybe (regex)
  `(:greedy-repetition 0 1 ,regex))

(defparameter *simple-floats*
  (let ((digits '(:register (:greedy-repetition 1 nil :digit-class))))
    (ppcre:create-scanner `(:sequence
                             (:register (:regex "[+-]?"))
                             ,digits
                             ,(maybe `(:sequence "." ,digits))))))

Here above, the dot "." is read literally, not as a regular expression. That means you can match strings like "(^.^)" or "[]" that could be hard to write and read with escaped characters in string-only regexes. You can fall back to regular expressions as strings by using the (:regex "...") expression.

CL-PPCRE has an optimization where constant regular expressions are precomputed, at load time, using load-time-value. That optimization might not be applied if your regular expressions are not trivially constants, so you may want to wrap your own scanners in load-time-value forms. Just ensure that you have the sufficient definitions ready at load-time, like the auxiliary maybe function.