I'm currently running a python script against multiple web server. The general task is to find out broken (external) links within a cms. Script runs pretty well so far but in reason I test around 50 internal projects and each with several hundreds sub pages. This ends in several thousands external links i have to check.
For that reason I added multi-threading - improves performance as it was my wish. But here comes the problem. If there is a page to check which contains a list of links to the same server (bundle of known issues or tasks to do) it will slow down the destination system. I neither would like to slow my own server nor server that are not mine.
Currently I running up to 20 threads and than waiting 0.5s until a "thread position" is ready to use. To check if a URL is broken I deal with urlopen(request) coming from urllib2 and log every time it throws an HTTPError. Back to the list of multiple URLs to the same server... my script will "flood" the web server with - cause of multi-threading - up to 20 simultaneous requests.
Just that you have an idea in which dimensions this script runs/URLs have to check: Using only 20 threads "slows" down the current script for only 4 projects to 45min running time. And this is only checking .. Next step will be to check broken URLs for . Using the current script shows us some peaks with 1000ms response time within server monitoring.
Does everyone has an idea how to improve this script in general? Or is there a much better way to check this big amount of URLs? Maybe a counter that pause the thread if there are 10 requests to a single destination?
Thanks for all suggestions
When I was running a crawler, I had all of my URLs prioritized by domain name. Basically, my queue of URLs to crawl was really a queue of domain names, and each domain name had a list of URLs.
When it came time to get the next URL to crawl, a thread would pull a domain name from the queue and crawl the next URL on that domain's list. When done processing that URL, the thread would put the domain on a delay list and remove from the delay list any domains whose delay had expired.
The delay list was a priority queue ordered by expiration time. That way I could give different delay times to each domain. That allowed me to support the crawl-delay extension to robots.txt. Some domains were ok with me hitting their server once per second. Others wanted a one minute delay between requests.
With this setup, I never hit the same domain with multiple threads concurrently, and I never hit them more often than they requested. My default delay was something like 5 seconds. That seems like a lot, but my crawler was looking at millions of domains, so it was never wanting for stuff to crawl. You could probably reduce your default delay.
If you don't want to queue your URLs by domain name, what you can do is maintain a list (perhaps a hash table or the python equivalent) that holds the domain names that are currently being crawled. When you dequeue a URL, you check the domain against the hash table, and put the URL back into the queue if the domain is currently in use. Something like:
That will work, although it's going to be a big CPU pig if the queue contains a lot of URLs from the same domain. For example if you have 20 threads and only 5 different domains represented in the queue, then on average 15 of your threads will be continually spinning, looking for a URL to crawl.