getting back additional info when webscraping with cheerio js

254 views Asked by At

I am working with cheerio.js to make a simple web scraper. For some reason it does not respond to certain html tags. One div I cannot target is the div with the class of 'dataTables_scrollBody' on the website that I am scraping: http://www.caffeineinformer.com/the-caffeine-database.

However, I think I found a work-around to my problem.

I read through the documentation https://github.com/cheeriojs/cheerio and am following this format $( selector, [context], [root] .

$(".main, div:nth-child(3) ").filter(function(){
        var data = $(this).prev().text();
        console.log(data);
})

In my console I am getting the data that I desire but with two problems

1.  Caffeine Content of Drinks All Coffee Soda Energy Drinks Tea Shots
    Loading data.../*<![CDATA[*/var totalrows=1127;
    var latestdate='06/12/2015';var tbldata=

I do not see this info on the page.

2.  I am getting my data back two times.

I put in a console.log for the data length. I got back 8 different lengths. I believe there is a workaround. However, I cannot figure this out.

Does anyone have any knowledge on the matter?

2

There are 2 answers

1
robertklep On BEST ANSWER

DataTables is a Javascript library that dynamically creates, inserts and modifies HTML elements in the DOM, after the page has been loaded. The table you want to scrape is created dynamically, but your scraper only works on static HTML.

The data that is used to generate the table is stored as Javascript in the page source, in a variable called tbldata (see this gist).

Two possible solutions:

  • use something like PhantomJS to load the page, which will also run any JS on the page. After that, you can take the DOM and parse it using Cheerio;
  • scrape the table data from the embedded Javascript directly.
0
Winnemucca On

Robert klep was correct I was attempting to scrape the DataTables. I found that although cheerio uses jquery it did access the data table inside of phantom js. I ended up working with a very basic library node-phantom-simple. Node phantom simple works well with jQuery and has basic but straight forward examples.

I was able to require node phantom simple then run nodemon to do my scrape.

Node phantom simple access without requiring the user to call phantomjs on the command line.