I am using web harvest (http://web-harvest.sourceforge.net/), the open source web scraping tool.
The regex I am trying to use has "<", ">" characters (because I am trying to strip out all HTML tags that come in). This causes a problem because the content of the elements must consist of well-formed character data or markup.
I need to somehow escape the regex, but can't figure out how.
Any ideas?
To make the regular expression well-formed XML. Try replacing
<
with<
and>
with>
. Similarly if you have an&
in your regular expression you will need to replace that with&
.Also I'd suggest you use an HTML parser instead of a regular expression for this task.