I'm trying to parse a numeric data that includes a lower than symbol like this: < 0.003 Having this original tags in the document to parse;
<TD VALIGN=TOP><P ALIGN=LEFT>
<FONT FACE="Helvetica">
<SPAN STYLE="Font-Size: 12pt"><0.0003</SPAN>
</FONT>
</TD>
The problem is that cheerio handle it as an unclosed html tag and close it generating a tag with the name of the value like this:
<td valign="TOP">
<p align="LEFT">
<font face="Helvetica">
<span style="Font-Size: 12pt">
<0.0003< span=""></0.0003<>
</span>
</font>
</p>
</td>
I've tried with no luck the option decodeEntities:true
How can I correctly get the value?
Thanks.
The reason you're seeing this is because a bare
<
is an illegal character as far as HTML is concerned which will cause a great amount of hurt for cheerio, which expects its input to be well-defined.(While you might imagine this should cause browsers to fail to render the page, the "official" HTML specification suggests that browsers should be as forgiving as they can be with "badly formed" input.)
One way to fix this problem would be to run your content through a regular expression to correctly convert all bare
<
(those that appear before to their correct HTML entity: '<' as in:will produce:
The regular expression is:
Adding the
/g
flag to the regular expression causes it to be run as many times as the pattern matches the content.