RegEx for matching schema nesting

Asked by At

I need to detect the nesting of one tag in another to raise an error.

Examples :

anything <amb id="1">word1</amb> anything <amb id="2">word2</amb> anything // OK

anything <amb id="1">anything<amb id="2">word2</amb>anything</amb> anything // KO

It is therefore necessary to detect the presence of tags <amb... or </amb> between the tags <amb... and </amb>

I have a beginning of a pattern, but I can't manage the nested presence of the tag.

// #\<amb(.*?)\<\/amb\># => OK : detect the first level
$pattern = '#\<amb(?!\<amb)\<\/amb\>#'; // KO

if(preg_match($pattern, $string)) {
  throw new Exception('No nested tags are allowed.');
}

How do I solve this problem?

2 Answers

3
Pushpesh Kumar Rajwanshi On Best Solutions

One way to check the nesting of tags is to check, if two continuous <amb tags are appearing without having a </amb> tag in between, then you can reject the string saying there is nesting of tags. This negative look ahead based regex should do the job,

^(?!.*<amb(?:(?!<\/amb).)+<amb).+$

Regex Demo 1

Similarly, another way is to check if two continuous <\/amb> tags appear without having a <amb tag in between, it means the tags are nested and again you can reject the string using this negative look ahead based regex,

^(?!.*<\/amb>(?:(?!<amb).)+<\/amb>).+$

Regex Demo 2

Let me know if this works for you.

1
KIKO Software On

You don't need regular expressions for this. They are a pain. What you can do is explode the string on </amd> and then check that each part has, at most, one <amb in it. Like so:

function correctlyNested($html, $tag)
{
    foreach (explode("</$tag>", strtolower($html)) as $part) {
       if (substr_count($part, "<$tag") > 1) return false; // it is KO
    }
    return true; // it is OK
}


$tests = ['anything <amb id="1">word1</amb> anything <amb id="2">word2</amb> anything',
          'anything <amb id="1">anything<amb id="2">word2</amb>anything</amb> anything'];

foreach ($tests as $test) {
    echo $test . (correctlyNested($test, "amb") ? " // OK<br>" : " // KO<br>");
}

This code is easy to understand and maintain. I added the strtolower() to show how easy it is to extend this code.