python scrapy is slowing down by the time it is parsing

1k views Asked by At

I have a scraper bot, which works fine. But as time passes when it is scraping the speed gets down. I added concurrent request, download_delay:0,'AUTOTHROTTLE_ENABLED':False but result is same. It is starting with a fast pace but gets slower. I guess it is about caching, but do not know if I have to clean cache, or why it behaves so? The code is below would like to hear comments;

import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
import scrapy_xlsx

itemList=[]
class plateScraper(scrapy.Spider):
    name = 'scrapePlate'
    allowed_domains = ['dvlaregistrations.dvla.gov.uk']
    FEED_EXPORTERS = {'xlsx': 'scrapy_xlsx.XlsxItemExporter'}
    custom_settings = {'FEED_EXPORTERS' :FEED_EXPORTERS,'FEED_FORMAT': 'xlsx','FEED_URI': 'output_r00.xlsx', 'LOG_LEVEL':'INFO','DOWNLOAD_DELAY': 0,'CONCURRENT_ITEMS':300,'CONCURRENT_REQUESTS':30,'AUTOTHROTTLE_ENABLED':False}

    def start_requests(self):
        df=pd.read_excel('data.xlsx')
        columnA_values=df['PLATE']
        for row in columnA_values:
            global  plate_num_xlsx
            plate_num_xlsx=row
            base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
            url=base_url
            yield scrapy.Request(url,callback=self.parse, cb_kwargs={'plate_num_xlsx': plate_num_xlsx})

    def parse(self, response, plate_num_xlsx=None):
        plate = response.xpath('//div[@class="resultsstrip"]/a/text()').extract_first()
        price = response.xpath('//div[@class="resultsstrip"]/p/text()').extract_first()

        try:
            a = plate.replace(" ", "").strip()
            if plate_num_xlsx == plate.replace(" ", "").strip():
                item = {"plate": plate_num_xlsx, "price": price.strip()}
                itemList.append(item)
                print(item)
                yield item
            else:
                item = {"plate": plate_num_xlsx, "price": "-"}
                itemList.append(item)
                print(item)
                yield item
        except:
            item = {"plate": plate_num_xlsx, "price": "-"}
            itemList.append(item)
            print(item)
            yield item

process = CrawlerProcess()
process.crawl(plateScraper)
process.start()

import winsound
winsound.Beep(555,333)

EDIT: "log_stats"

{'downloader/request_bytes': 1791806,
 'downloader/request_count': 3459,
 'downloader/request_method_count/GET': 3459,
 'downloader/response_bytes': 38304184,
 'downloader/response_count': 3459,
 'downloader/response_status_count/200': 3459,
 'dupefilter/filtered': 6,
 'elapsed_time_seconds': 3056.810985,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 1, 27, 22, 31, 17, 17188),
 'httpcompression/response_bytes': 238767410,
 'httpcompression/response_count': 3459,
 'item_scraped_count': 3459,
 'log_count/INFO': 61,
 'log_count/WARNING': 2,
 'response_received_count': 3459,
 'scheduler/dequeued': 3459,
 'scheduler/dequeued/memory': 3459,
 'scheduler/enqueued': 3459,
 'scheduler/enqueued/memory': 3459,
 'start_time': datetime.datetime(2023, 1, 27, 21, 40, 20, 206203)}
2023-01-28 02:31:17 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0
4

There are 4 answers

8
Georgiy On

On a very first look code looks Ok. Hovewer I see several points that can lead to increasing scraping speed here:

  1. CONCURRENT_REQUESTS_PER_DOMAIN setting - as it didn't changed it keep default vaue 8 (no more than 8 requests at the same time). Recommended to increase it up to value of CONCURRENT_REQUESTS.
  2. CONCURRENT_ITEMS setting - we had several reports that increasing value of this setting may lead to degrated performance -/scrapy/issues/5182. Recommended to keep it default.
  3. custom scrapy_xlsx.XlsxItemExporter (assuming that https://github.com/jesuslosada/scrapy-xlsx used here) - On a first look I didn't expected issues as usually ~3000 items is not a lot data. Hovewer technically .xlsx file - is zipped archive with xml documents. It uses openpyxl that keep all file contents and it's parsed xml trees in RAM memory. Each added row increase size of xml tree of created .xlsx file as result of this adding each new row - can be more CPU intensive. - Recommented to compare scraping speed with usage of scrapy built-ins feed exporters (csv or json lines).
9
Md. Fazlul Hoque On

One of the biggest reasons of going slow down scraping data is to set high volumes of the CONCURRENT_ITEMS and CONCURRENT_REQUESTS. As it will take too many items and requests to process at the same time, so it will take lots of memory and slow down the pc/laptop and consequently, data scraping process will slow to finish.So You can decrease these values to lower numbers (e.g. 10 and 5 or 20 to 10) to reduce the load on the system and fast scraping. You also set the DOWNLOAD_DELAY : 0 to make it faster if you aren't getting block, if so, then you can set delay to a small value (e.g. 0.5) to slow down the requests. You also can use AUTOTHROTTLE_ENABLED : True, then scrapy will automatically adjust the delay between requests based on the response time of the website.

import scrapy
from scrapy.crawler import CrawlerProcess
import pandas as pd
import scrapy_xlsx

class PlateScraper(scrapy.Spider):
    name = 'scrape_plate'
    allowed_domains = ['dvlaregistrations.dvla.gov.uk']
    custom_settings = {
        'FEED_EXPORTERS': {'xlsx': 'scrapy_xlsx.XlsxItemExporter'},
        'FEED_FORMAT': 'xlsx',
        'FEED_URI': 'output_r00.xlsx',
        'LOG_LEVEL': 'INFO',
        'DOWNLOAD_DELAY': 0,
        'CONCURRENT_ITEMS': 10,
        'CONCURRENT_REQUESTS': 5,
        'AUTOTHROTTLE_ENABLED': False
    }

    def start_requests(self):
        df = pd.read_excel('data.xlsx')
        column_a_values = df['PLATE']

        for plate_num in column_a_values:
            base_url = f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
            yield scrapy.Request(base_url, callback=self.parse, cb_kwargs={'plate_num': plate_num})

    def parse(self, response, plate_num):
        plate = response.xpath('//div[@class="resultsstrip"]/a/text()').extract_first()
        price = response.xpath('//div[@class="resultsstrip"]/p/text()').extract_first()

        try:
            if plate_num == plate.replace(" ", "").strip():
                item = {"plate": plate_num, "price": price.strip()}
            else:
                item = {"plate": plate_num, "price": "-"}
        except:
            item = {"plate": plate_num, "price": "-"}

        self.logger.info(item)
        yield item

if __name__ == "__main__":  
    process = CrawlerProcess()
    process.crawl(PlateScraper)
    process.start()
1
Vojtěch Chvojka On

Those performance-related issues are sometimes really hard to crack just from the code. The problem just could be in too many places. I'd suggest you run a Profiler, two times. 1) run it while the pace is quick. 2) run it for an extended period of time for the pace to slow down. The slower pace gets, the better.

Then, compare both results. There will probably be a percentage increase in some operations. Maybe you'll see an increase in the .xlsx compression time as suggested before, maybe there will be a long operation searching through the cache, or maybe something completely different. This increase will point you to the right place to look for a bug.

Measuring your code could look something like that (But be sure to refer to a doc):

...
import cProfile
...

pr = cProfile.Profile()
pr.enable()

try:
    process = CrawlerProcess()
    process.crawl(plateScraper)
    process.start()
    # Press CTRL+c any time you want to finish profiling.
except KeyboardInterrupt:
    pass

pr.disable()
# Save/print profiler results here.
2
Kushagra A. Nalwaya On

Discovering a straightforward method to address the issue of slowing scrapers over time can greatly enhance your web scraping efficiency. Using Selenium in combination with strategic pauses and server changes can significantly improve the robustness of your scraping operations. Let's delve into this process in detail:

1. Pausing the Scraper When it Slows Down:

When your scraper encounters a slowdown, you can effectively pause it using the WebDriverWait(driver, time) method within a try-except block. Set the "time" parameter to an appropriate value based on your specific program requirements. Here's how it works: if the program experiences a slowdown, it will fail to meet the WebDriverWait condition and trigger the except block. In this except block, you can pause the scraper by requesting user input, typically by pressing Enter, like: input("Press enter to continue: ) Meanwhile, during this pause, you can proceed to change the server.

2. Changing Servers:

Before changing servers, it's important to understand why this step is crucial. When numerous requests are sent from a single connection, certain websites may either block the IP address or throttle the connection speed to prevent potential Denial-of-Service (DoS) attacks. Hence, it's advisable to switch servers and begin afresh. To achieve this, you can employ various methods. A straightforward approach is to utilize a VPN service. While free options like [Psiphon][1] are available, they tend to have limited longevity and may slow down quickly. As a starting point, you can experiment with these free options. If they don't resolve the issue satisfactorily, consider upgrading to paid VPN services like Nord or ExpressVPN. These paid services offer a multitude of servers to choose from, allowing you to effortlessly switch servers whenever your scraper encounters slowdowns.

3. Resuming the Scraper:

Once the VPN connects, you can press "Enter" as input to signal the scraper to continue from where it left off in step 1. This will again speed up your scraper speed as the request will be going through a new connection.

If it slows down again, simply repeat the steps outlined from 1 to 3.

By implementing this strategy, you can significantly enhance the resilience and efficiency of your web scraping endeavors.