Regular expression with negative lookahead and negative lookbehind to check if my match IS NOT between [[ and ]]

101 views Asked by At

I'm trying to write a Python script which replaces occurrences of given keywords, in given md file, by themself between [[ and ]].

It will be used several times on the same files, so I don't want to end with, for instance, FOO becoming [[FOO]], then [[[[FOO]]]] etc.

So I don't want FOO to be circle with [[ and ]].

The closest version I came up with is this: (?<!\[\[)\b(FOO)\b(?!\]\])

The status of my test list is:

Should     match : lorem ipsum FOO dolor              ==> OK
Should NOT match : lorem ipsum [[FOO]]  dolor         ==> OK
Should NOT match : lorem [[ipsum FOO dolor]] sit amet ==> Not OK
Should NOT match : lorem [[ipsumFOOsolor]] sit amet   ==> OK
Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]] ==> Not OK

for reference, I would like to use this regexp in this python snippet:

    for term in term_list:
        pattern = r'(?<!\[\[)\b(' + re.escape(term) + r')\b(?!\]\])'
        file_content = re.sub(pattern, r'[[\1]]', file_content)

What could be the regexp I need? What is wrong with this approach?

Thanks!

2

There are 2 answers

2
The fourth bird On BEST ANSWER

What you might do, not taking nested [[..[[..]]..]] into account, is to get the [[...]] part out of the way and capture what you want to keep in a group.

Then use that group in the replacement, and leave the part that is only matched (not in the group) untouched.

You can see the regex matches here.

This part in the pattern (?:(?!\[\[|]]).)* matches any charter that is not directly followed by either [[ or ]]

import re

pattern = r"\[\[(?:(?!\[\[|]]).)*\]\]|\b(FOO)\b"

s = ("lorem ipsum FOO dolor\n"
            "Should NOT match : lorem ipsum [[FOO]]  dolor\n"
            "Should NOT match : lorem [[ipsum FOO dolor]] sit amet\n"
            "Should NOT match : lorem [[ipsumFOOsolor]] sit amet\n"
            "Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]]")

result = re.sub(pattern, lambda x: f"[[{x.group(1)}]]" if x.group(1) else x.group(), s)
print(result)

Output

lorem ipsum [[FOO]] dolor
Should NOT match : lorem ipsum [[FOO]]  dolor
Should NOT match : lorem [[ipsum FOO dolor]] sit amet
Should NOT match : lorem [[ipsumFOOsolor]] sit amet
Should NOT match : [[lorem]]  [[ipsum-FOO&dolor-sit.pdf#page=130]]
4
Cary Swoveland On

I assume the following conditions are satisfied, as they are for all the examples given in the question.

  • Single brackets are not present in the string. That is, any open bracket ("[") must be preceded or followed by exactly one open bracket. Similar for closed brackets ("]").
  • Double-brackets are always balanced, e.g., there is no "[[...]]...[[...".
  • Double-brackets are not nested, e.g., there is no "...[[...[[...]]...FOO...]]...".
  • The keyword string cannot be immediately preceded or followed by a word character (e.g., no "...catFOO dog...", "...cat FOOdog..." or "...catFOOdog...").

For example, "FOO" should be matched in these three strings:

lorem ipsum FOO dolor
lorem ipsum FOO dolor [[amet]]
[[lorem]] ipsum FOO dolor [[sit]] amet

but not in these six:

lorem ipsum [[FOO]] dolor
lorem [[ipsum FOO dolor]] sit amet
lorem [[ipsumFOOsolor]] sit amet
[[lorem]]
[[ipsum-FOO&dolor-sit.pdf#page=130]]
lorem ipsumFOO solor sit amet

If the conditions above are satisfied we may conclude that the keyword string is not within a double-bracket-delimited string if the following regular expression is matched.

\bFOO\b(?![^[\]]*(?:\[\[[^[\]]*]][^[\]]*)*]])

The idea is simple: given the assumptions above, particular that concerning the double-brackets being balanced, if FOO is followed later in the string by "]]", possibly with intervening clauses of the form [[...]], then FOO must be preceded in the string by "[[", possibly with intervening clauses of the form [[...]], meaning that "FOO" is within a clause of the form [[...FOO...]].

Demo1.

Note that "FOO" should be matched in the following string but it is not because the double-brackets do not satisfy the condition that they are balanced.

lorem FOO [[ipsum solor]] sit ]] amet

The regular expression can be broken down as follows. (One may hover over each element of the expression at the link to obtain an explanation of its function. To be clear, one should hover the cursor, not one's person.)

\bFOO\b       match the literal 'FOO' with pre- and post- word boundaries
(?!           begin a negative lookahead
  [^[\]]*     match >= 0 (`*`) characters other than '[' and ']'
  (?:         begin a non-capture group
    \[\[      match the literal '[['
    [^[\]]*   match >= 0 (`*`) characters other than '[' and ']'
    ]]        match the literal ']]'
    [^[\]]*   match >= 0 (`*`) characters other than '[' and ']'
  )*          end the non-capture group and execute it >= 0 times
  ]]          match the literal '[['
)             end the negative lookahead

1. I added a newline (\n) to the regular expression at the link in order to test multiple strings.