HTMLAgilityPack returning HTTP Response Code of "not found" for a page which exists

36 views Asked by At

I'm writing my own Web crawler to find bad links on my website and to create a sitemap on the fly nightly. I pass in a starting URL, pull the content down, and then use HtmlAgilityPack to scrape the page to find anything which links to another url, image, css, javascript file. I build up a list of URLs to check and record the status of each.

It's working great, except that a handful of URLs which are links to external sites come back as "404 Not found", yet when I go to the URL that the HtmlAgilityPack attempted to load, the URL exists. Not all external links have this issue, most come back as OK. My code which pulls in the document is:

var page = new HtmlWeb();
var tcs = new TaskCompletionSource<HttpWebResponse>();

page.PostResponse = delegate (HttpWebRequest request, HttpWebResponse response)
{
    tcs.SetResult(response);
};

var doc = page.Load(pageAddress);
var httpWebResponse = tcs.Task.Result;

I suspect it might have to do with something in the HttpWebRequest that HtmlAgilityPack is using to call the page, but I'm not exactly sure. Any ideas?

Edit, here's an example: https://cdn.datatables.net/1.13.2/css/jquery.datatables.min.css

1

There are 1 answers

0
LarryBud On

The issue ended up being my own fault: My application is forcing the urls to lower case, and apparently this CDN is case sensitive.

If you go to

https://cdn.datatables.net/1.13.2/css/jquery.dataTables.min.css

The url works, but

https://cdn.datatables.net/1.13.2/css/jquery.datatables.min.css

is a 404.