import scraperwiki
import urllib2, lxml.etree
url = 'http://eci.nic.in/eci_main/statisticalreports/SE_1998/StatisticalReport-DEL98.pdf'
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)
# how many pages in PDF
pages = list(root)
print "There are",len(pages),"pages"
#from page 86 to 107
for page in pages[86:107]:
for el in page:
data = {}
if el.tag == "text":
if int(el.attrib['left']) < 215: data = { 'Rank': el.text }
elif int(el.attrib['left']) < 230: data['Name'] = el.text
elif int(el.attrib['left']) < 592: data['Sex'] = el.text
elif int(el.attrib['left']) < 624: data['Party'] = el.text
elif int(el.attrib['left']) < 750: data['Votes'] = el.text
elif int(el.attrib['left']) < 801: data['Percentage'] = el.text
print data
Now I am wondering how to save this data in the database in scraperwiki. I have tried a few commands like
scraperwiki.sqlite.save(unique_keys=[], table_name='ecidata1998', data=data)
but they dont give me the required result when I check the dataset, Is there something wrong with the code or the last statement. Please help. New at Python programming and Scraperwiki.
There are a couple of problems with your code.
First, the conditions you've set to pull different content from the PDF need to be made more restricted and precise (e.g.
if int(el.attrib['left']) < 215will pull any text that has a left position of less than 215 pixels, which applies to other content in the PDF pages you're looking at, e.g. the text "Constituency").Second, you need a way to check when you have all the data for that row and can move on to the next one. (You could try and pull the data out by rows, but I just found it easier to grab data from each field in turn and make a new row when I had all the data for that row.)
(As to why
scraperwiki.sqlite.savewasn't working, it's probably because you had rows of empty values in there, but your data as you had it wasn't correct anyway.)This works for me: