Performance Optimization of scraping code

Question

Performance Optimization of scraping code

406 views Asked by Mohammad Abu Musa At 11 May 2014 at 09:28

I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.

My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).

#!/usr/bin/env python

import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
    try:
        html = requests.get("http://testserver.dc/"+str(i)"/").content 
        dom = lxml.html.fromstring(html)
        for entry in dom.cssselect('.rTopHeader'):
            name = entry.cssselect('.bold')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            city = entry.cssselect('li:nth-child(2) span')[0].text_content()

        for entry in dom.cssselect('div#rProfile'):
            profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
        profile = {
                'name':name,
                'city':city,
                'profile_id':profile_id
            }
        unique_keys = [ 'profile_id' ]
        scraperwiki.sql.save(unique_keys, profile)
        print jeeran_id            
    except:
        print 'Error: ' + str(i)

Original Q&A

There are 1 answers

**Jan Vlcinsky** · Answer 1 · 2014-05-11T10:57:58+00:00

It is very good, you have clear aim at how far you want to optimize.

Measure time needed for scraping first

It is likely, the limiting factor is scraping the urls.

Simplify your code and measure, how long it takes to scrape. If this does not meet your timing criteria (like one request would take 0.5 seconds), you have to go do scraping in parallel. Search StackOverflow, there are many such questions and answers, using threading, green threads etc.

Further optimization tips

Parsing XML - use iterative / SAX approach

Your DOM creation can be turned into iterative parsing. It need less memory and is very often much faster. lxml allows methods like iterparse. Also see related SO answer

Optimize writing into database

Writing many records one by one can be turned into writing them in bunches.

TechQA.

Performance Optimization of scraping code

There are 1 answers

Measure time needed for scraping first

Further optimization tips

Parsing XML - use iterative / SAX approach

Optimize writing into database

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in SCRAPERWIKI

Popular Questions

Trending Questions