Link sniffing - http get response codes wrong

131 views Asked by At

I have the task to go trough around 200k links and check the status code of their responses. Anything other than 2xx would mean a problem which means that link has to be manually checked (added to a DB later).

The links I have from a DB and are both http and https, some of them are not valid (e.g. ww.fyxs.d). The format I get is JSON and it's something like this

{
    "id": xxx,
    "url": xxxx
}

I went with a really simple solution which unfortunately doesn't work.

I am taking the links from a json file and then starting from the back sending a http/https.get request, waiting for the response, checking and processing the status code and moving to the next link after removing the previous one from the list to preserve memory. The problem is that I keep getting 4xx almost all the time and if I do a GET from a REST client I get a 200 OK.

I don't know if it's possible but I only need the correct status code and the body I'm not interested in hence the HEAD method. I also tried with method: 'GET' - still wrong status codes and http/https.request - I don't even get a response.

Here is my code:

var https = require('https');
var http = require('http');
var urlMod = require('url');

var links = require('./links_to_check.json').links_to_check;

var callsRemaining = links.length;
var current = links.length - 1;

startQueue();

function startQueue(){
 getCode(links[current].url);
 current--;
}

function getCode(url){
 var urlObj = urlMod.parse(url);
 var options = {
  method: 'HEAD',
  hostName: urlObj.host,
  path: urlObj.path
 };
 var httpsIndex = url.indexOf('https');
 if(httpsIndex > -1 && httpsIndex < 5){
  https.get(options,function(response){
   proccessResponse(response.statusCode);
  }).on('error', (e) => {
   startQueue();
  });
 }else{
  if(url.indexOf('http:') < 0) return;
  http.get(options,function(response){
   proccessResponse(response.statusCode);
  }).on('error', (e) => {
   startQueue();
  });
 }
}

function proccessResponse(responseCode){ 
 console.log("response => " + responseCode);
 if(responseCode != 200){
  errorCount++;
 }
 ResponseReady();
}

function ResponseReady(){
 --callsRemaining;
 if(callsRemaining <= 0){
  //Proccess error when done
 } 
 links.pop();
 startQueue();
}

I would really appreciate some help - when we succeed I will publish it as a module so if someone needs to check a set of links they can just use it :)

After we solve this I was thinking of using async.map and splitting the links to chunks and running the analysis in parallel so it's faster. The current process written in shell takes around 36 hours.

0

There are 0 answers