I'm trying to write a script that parses a block of HTML and matches words against a given glossary of terms. If it finds a match, it wraps the term in <a class="tooltip"></a>
and provides a definition.
It's working okay -- except for two major shortcomings:
- It matches text that is in attributes
- It matches text that is already in an
<a>
tag, created a nested link.
Is there any way to have my regular expression match only words that are not in attributes, and not in <a>
tags?
Here's the code I'm using, in case it's relevant:
foreach(Glossary::map() as $term => $def) {
$search[] = "/\b($term)\b/i";
self::$lookup[strtoupper($term)] = $def;
}
return preg_replace_callback($search, array(&$this,'replace'),$this->content);
"Don't do that with a regex."
Use an HTML parser, then apply a regex to the contents of HTML elements as it identifies them. That will allow you to easily operate on lots of different variants of HTML structure, valid and otherwise, without a lot of cruft and hard-to-maintain regular expressions.
Robust and Mature HTML Parser for PHP