Parsing lower than symbol with nodejs and cheerio

375 views Asked by At

I'm trying to parse a numeric data that includes a lower than symbol like this: < 0.003 Having this original tags in the document to parse;

<TD VALIGN=TOP><P ALIGN=LEFT>
  <FONT FACE="Helvetica">
    <SPAN STYLE="Font-Size: 12pt"><0.0003</SPAN>
  </FONT>
</TD>

The problem is that cheerio handle it as an unclosed html tag and close it generating a tag with the name of the value like this:

<td valign="TOP">
    <p align="LEFT">
        <font face="Helvetica">
            <span style="Font-Size: 12pt">
                <0.0003< span=""></0.0003<>
            </span>
        </font>
    </p>
</td>

I've tried with no luck the option decodeEntities:true

How can I correctly get the value?

Thanks.

2

There are 2 answers

1
Rob Raisch On

The reason you're seeing this is because a bare < is an illegal character as far as HTML is concerned which will cause a great amount of hurt for cheerio, which expects its input to be well-defined.

(While you might imagine this should cause browsers to fail to render the page, the "official" HTML specification suggests that browsers should be as forgiving as they can be with "badly formed" input.)

One way to fix this problem would be to run your content through a regular expression to correctly convert all bare < (those that appear before to their correct HTML entity: '<' as in:

// assumes the page content is kept in this var.
var content = '<p>1 < 2<br/>3</p>2 < 3'; 

content=content.replace(/<([^<>]+(?:<|$))/g,'&lt;$1');

will produce:

<p>1 &lt; 2<br/>3</p> 2 &lt; 3

The regular expression is:

<            -- literal '<'
(            -- start capture group #1
  [^<>]+     -- one or more characters except literal '<' or '>'
  (?:        -- start non-capturing group
    <        -- literal '<'
  |          -- OR
    $        -- end of content
  )          -- end of non-capturing group
)            -- end of capture group #1

Adding the /g flag to the regular expression causes it to be run as many times as the pattern matches the content.

0
karthick On

There is an open issue in the github regarding the custom option for parsing the html. Expected in future.

Work Around:

var $ = cheerio.load('<TD VALIGN=TOP>kk<P ALIGN=LEFT><FONT FACE="Helvetica"><SPAN STYLE="Font-Size: 12pt">>0.0003</SPAN></FONT></TD>');

$("*").each(function(i, elem) {
    console.log($(elem)[0].name);
});

$(elem)[0].name will log all the tag name, You can use some regex to filter out the number.

Note: Problem is only for < symbol. For > symbol it will parse as &gt;