I am studying web scraping for big data, so I wrote the following code to take some information from a local server on our campus. It works fine but I think the performance is very slow; each record takes 0.91s to get stored in the database. What the code does is open a web page, take some content and store it on disk.
My goal is to lower the time elapsed for scraping a record to something near 0.4s (or less, if possible).
#!/usr/bin/env python
import scraperwiki
import requests
import lxml.html
for i in range(1, 150):
try:
html = requests.get("http://testserver.dc/"+str(i)"/").content
dom = lxml.html.fromstring(html)
for entry in dom.cssselect('.rTopHeader'):
name = entry.cssselect('.bold')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
city = entry.cssselect('li:nth-child(2) span')[0].text_content()
for entry in dom.cssselect('div#rProfile'):
profile_id = entry.cssselect('li:nth-child(3) strong a')[0].get('href')
profile = {
'name':name,
'city':city,
'profile_id':profile_id
}
unique_keys = [ 'profile_id' ]
scraperwiki.sql.save(unique_keys, profile)
print jeeran_id
except:
print 'Error: ' + str(i)
It is very good, you have clear aim at how far you want to optimize.
Measure time needed for scraping first
It is likely, the limiting factor is scraping the urls.
Simplify your code and measure, how long it takes to scrape. If this does not meet your timing criteria (like one request would take 0.5 seconds), you have to go do scraping in parallel. Search StackOverflow, there are many such questions and answers, using threading, green threads etc.
Further optimization tips
Parsing XML - use iterative / SAX approach
Your DOM creation can be turned into iterative parsing. It need less memory and is very often much faster.
lxmlallows methods likeiterparse. Also see related SO answerOptimize writing into database
Writing many records one by one can be turned into writing them in bunches.