what is wrong with my word boundary regex?

1.8k views Asked by At

I have the following little Python script:

import re

def main ():
    thename = "DAVID M. D.D.S."
    theregex = re.compile(r"\bD\.D\.S\.\b")
    if re.search(theregex, thename):
        print ("you did it")
main ()

It's not matching. But if I adjust the regex just slightly and remove the last . it does work, like this:

\bD\.D\.S\b

I feel I'm pretty good at understanding regexes, but this has be baffled. My understanding of \b (word boundary) should be the a zero width match of non alpha-numeric (and underscore). So I would expect

"\bD\.D\.S\.\b"

to match:

D.D.S.

What am I missing?

2

There are 2 answers

4
Adam Katz On BEST ANSWER

This doesn't do what you might think it does.

r"\bD\.D\.S\.\b"

Here is an explanation of that regex, with the same examples that are listed below:

D.D.S.   # no match, as there is no word boundary after the final dot
D.D.S.S  # matches since there is a word boundary between `.` and `S` at the end

Word boundaries are zero-width matchers between word characters (\w, which is [0-9A-Za-z_] plus other "letters" as defined by your locale) and non-word characters (\W, which is the inversion of the previous class). Dot (.) is not a word character, so  D.D.S.  (note trailing whitespace) has word boundaries (only!) in the following places:  \bD\b.\bD\b.\bS\b.  (I didn't escape the dots because I'm illustrating the word boundaries, not making a regular expression).

I assume you are trying to match a end of line or whitespace. There are two ways to do that:

r"\bD\.D\.S\.(?!\S)"   # by negation: do not match a non-whitespace
r"\bD\.D\.S\.(?:\s|$)" # match either a whitespace character or end of line

I've refined the above regex explanation link to explain the negation example above (note the first ends in …/1 while the second ends in …/2; feel free to further experiment there, it is nice and interactive).

1
codeonly On
  • \.\b matches .bla - checks for word character after .
  • \.\B the opposite matches bla. but not bla.bla - checks for non word after .
\bD\.D\.S\.\B