regex with lookbehind weird behavior

94 views Asked by At

I have been trying to resolve this for the past 2 days...

Please help me in understanding why this is happening. My intention is to just select the <HDR> that has a <DTL1 val="92">.....</HDR>

This is my regular expression

(?<=<HDR>).*?<DTL1\sval="3".*?</HDR>

And the input string is:

<HDR>abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR><HDR><DTL1 val="3"><DTL2 val="4"></HDR>

But this regular expression selects

abc<DTL1 val="1"><DTL2 val="2"></HDR><HDR><DTL1 val="92"><DTL2 val="55"></HDR>

Can anyone please help me?

2

There are 2 answers

4
Adi Inbar On

Casimir et Hippolyte already gave you a couple of good solutions. I want to elaborate on a few things.

First, why your regex fails to do what you want: (?<=<HDR>).*? tells it to match any number of characters starting with the first character preceded by <HDR>, until it encounters what follows the non-greedy quantifier (<DTL1...). Well, the first character that's preceded by <HDR> is the first a, so it matches everything starting from there until the fixed string <DTL1\sval="3" is encountered.

Casimir et Hippolyte's solutions are for the generalized case, where the contents of the <HDR> tags can be anything other than nested <HDR>'s. You could also do it with a positive look-ahead:

(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>

However, if the string is guaranteed to be in the structure shown, where the <HDR> tags only contain one or more <DTL1 val="##"> tags, so you know there won't be any closing tags within, you could do it more efficiently by replacing the first .*? with [^/]*:

(?<=<HDR>)[^/]*<DTL1\sval="3".*?</HDR>

A negated character class is more efficient than a zero-width assertion, and if you're using a negated character class, a greedy quantifier becomes more efficient than a lazy one.

Note also that by using a lookbehind to match the opening <HDR>, you're excluding it from the match, but you're including the closing </HDR>. Are you sure that's what you want? You're matching this...

<DTL1 val="3"><DTL2 val="4"></HDR>

...when presumably you want this...

<HDR><DTL1 val="3"><DTL2 val="4"></HDR>

...or this...

<DTL1 val="3"><DTL2 val="4">

So, in the fist case, don't use a lookbehind for the opening tag:

<HDR>(.(?!</HDR>))*<DTL1\sval="3".*?</HDR>
<HDR>[^/]*<DTL1\sval="3".*?</HDR>

In the second case, use a look-ahead for the closing tag:

(?<=<HDR>)(.(?!</HDR>))*<DTL1\sval="3".*?(?=</HDR>)
(?<=<HDR>)[^/]*<DTL1\sval="3".*?(?=</HDR>)
5
Casimir et Hippolyte On

A regex engine will give you always the leftmost match in a string (even if you use a non-greedy quantifier). This is exactly what you obtain.

So, a solution is to forbid the presence of another <HDR> in the parts described by .*? that is too permissive.

You have two technics to do that, you can replace the .*? with:

(?>[^<]+|<(?!/HDR))*

or with:

(?:(?!</HDR).)*+

Most of the time, the first technic is more performant, but if your string contains an high density of <, the second way can give good results too.

The use of a possessive quantifier or an atomic group can reduce the number of steps to obtain a result in particular when the subpattern fails.

Example:

With the first way:

(?<=<HDR>)(?>[^<]+|<(?!/HDR))*<DTL1\sval="3"(?>[^<]+|<(?!/HDR))*</HDR>

or this variant:

(?<=<HDR>)(?:[^<]+|<(?!/HDR|DTL1))*+<DTL1\sval="3"(?:[^<]+|<(?!/HDR))*+</HDR>

With the second way:

(?<=<HDR>)(?:(?!</HDR).)*<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>

or this variant:

(?<=<HDR>)(?:(?!</HDR|DTL1).)*+<DTL1\sval="3"(?:(?!</HDR).)*+</HDR>