Scraping request run through Heroku Scheduler (Node.js) fails half the time

818 views Asked by At

I set up a very basic web scraper to check stock of a specific item on Costco.com for my grandfather. It's working great locally, but when I run it through Heroku it fails (seemingly 50% of the time). Here's the code for the scraper

const task = () => {

  // toggle so doesn't send message multiple times if continuously available
  let alreadyAvailable = false;

  let url = 'http://www.costco.com/Kirkland-Signature-Four-Piece-Urethane-Cover-Golf-Ball,-2-dozen.product.100310467.html';
  request(url, function(error, response, html){

    let $ = cheerio.load(html);

    if(error){
      throw new Error(error);
    }

    if ( $('#product-page #product-details #ctas #add-to-cart input[type="button"]')['0'].attribs.value === 'Out of Stock') {
      alreadyAvailable = false;
      console.log("still out of stock");
    } else {
      if (alreadyAvailable === false) {
        sendMessage();
        alreadyAvailable = true;
      }
    }

  });
};

and here are the logs

2016-12-25T03:48:39.675549+00:00 heroku[scheduler.5440]: Starting process with command `node scraper.js`
2016-12-25T03:48:40.262503+00:00 heroku[scheduler.5440]: State changed from starting to up
2016-12-25T03:48:41.509416+00:00 app[scheduler.5440]: /app/scraper.js:34
2016-12-25T03:48:41.509432+00:00 app[scheduler.5440]:     if ( $('#product-page #product-details #ctas #add-to-cart input[type="button"]')['0'].attribs.value === 'Out of Stock') {
2016-12-25T03:48:41.509433+00:00 app[scheduler.5440]:                                                                                          ^
2016-12-25T03:48:41.509433+00:00 app[scheduler.5440]:
2016-12-25T03:48:41.509434+00:00 app[scheduler.5440]: TypeError: Cannot read property 'attribs' of undefined
2016-12-25T03:48:41.509434+00:00 app[scheduler.5440]:     at Request._callback (/app/scraper.js:34:90)
2016-12-25T03:48:41.509435+00:00 app[scheduler.5440]:     at Request.self.callback (/app/node_modules/request/request.js:186:22)
2016-12-25T03:48:41.509436+00:00 app[scheduler.5440]:     at emitTwo (events.js:106:13)
2016-12-25T03:48:41.509436+00:00 app[scheduler.5440]:     at Request.emit (events.js:191:7)
2016-12-25T03:48:41.509436+00:00 app[scheduler.5440]:     at Request.<anonymous> (/app/node_modules/request/request.js:1081:10)
2016-12-25T03:48:41.509437+00:00 app[scheduler.5440]:     at emitOne (events.js:96:13)
2016-12-25T03:48:41.509437+00:00 app[scheduler.5440]:     at Request.emit (events.js:188:7)
2016-12-25T03:48:41.509438+00:00 app[scheduler.5440]:     at IncomingMessage.<anonymous> (/app/node_modules/request/request.js:1001:12)
2016-12-25T03:48:41.509438+00:00 app[scheduler.5440]:     at IncomingMessage.g (events.js:291:16)
2016-12-25T03:48:41.509439+00:00 app[scheduler.5440]:     at emitNone (events.js:91:20)
2016-12-25T03:48:41.509439+00:00 app[scheduler.5440]:     at IncomingMessage.emit (events.js:185:7)
2016-12-25T03:48:41.509439+00:00 app[scheduler.5440]:     at endReadableNT (_stream_readable.js:974:12)
2016-12-25T03:48:41.509440+00:00 app[scheduler.5440]:     at _combinedTickCallback (internal/process/next_tick.js:74:11)
2016-12-25T03:48:41.509440+00:00 app[scheduler.5440]:     at process._tickCallback (internal/process/next_tick.js:98:9)
2016-12-25T03:48:41.560539+00:00 heroku[scheduler.5440]: State changed from up to complete
2016-12-25T03:48:41.550655+00:00 heroku[scheduler.5440]: Process exited with status 1
2016-12-25T03:58:42.438807+00:00 app[api]: Starting process with command `node scraper.js` by user [email protected]
2016-12-25T03:58:43.701468+00:00 heroku[scheduler.5038]: Starting process with command `node scraper.js`
2016-12-25T03:58:44.312279+00:00 heroku[scheduler.5038]: State changed from starting to up
2016-12-25T03:58:45.769564+00:00 app[scheduler.5038]: still out of stock
2016-12-25T03:58:45.827867+00:00 heroku[scheduler.5038]: State changed from up to complete
2016-12-25T03:58:45.814921+00:00 heroku[scheduler.5038]: Process exited with status 0

You can see that sometimes I get the console log inside of the if-block and others I get a type error because it's trying to read attributes from an html element that don't exist. I was thinking that this might be a async issue, but I'm not sure how to go about fixing it. I assumed Request wasn't running the callback until it got all the html.

1

There are 1 answers

1
rdegges On

The problem here is what Costco's website is returning.

Your code is failing when parsing the DOM via Cheerio. What this means in your case is that the particular HTML you're attempting to scrape doesn't actually exist (that's what the error is saying).

This could be caused by a few possible things:

  • Costco is rendering a DIFFERENT page than what you expect (maybe it thinks you're a bot, or is doing some throttling).
  • You are receiving a redirect or some other sort of non-error HTTP status code, and the HTML you're looking for doesn't exist there.
  • Costco's website changes the HTML dynamically to prevent people from scraping.

What I would do if I were you is this:

  • Have your process log all the page's HTML when the task runs.
  • The next time your process fails, copy the HTML from the Heroku logs into a local editor, and see what it returned.

I'm willing to bet you will be surprised =)