Scraping Text from table using Soup / Xpath / Python

1.3k views Asked by At

I need help in extracting data from : http://agmart.in/crop.aspx?ccid=1&crpid=1&sortby=QtyHigh-Low

Using the filter, there are about 4 pages of data (Under rice crops) in tables I need to store.

I'm not quite sure how to proceed with it. been reading up all the documentation possible. For someone who just started python, I'm very confused atm. Any help is appreciated.

Here's a code snipet I'm basing it on :

Example website : http://www.uscho.com/rankings/d-i-mens-poll/

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[@id="rankings"]'):
    print section.xpath('h1[1]/text()')[0],
    print section.xpath('h3[1]/text()')[0]
    print
    for row in section.xpath('table/tr[@class="even" or @class="odd"]'):
        print '%-3s %-20s %10s %10s %10s %10s' % tuple(
            ''.join(col.xpath('.//text()')) for col in row.xpath('td'))
    print

I can't seem to understand any of the code above. Only understood that the URL is being read. :(

Thank you for any help!

1

There are 1 answers

0
Aditya On BEST ANSWER

Just like we have CSS selectors like .window or #rankings, xpath is used to navigate through elements and attributes in XML.

So in for loop, you're first searching for an element called "section" give a condition that it has an attribute id whose value is rankings. But remember you are not done yet. This section also contains the heading "Final USCHO.com Division I Men's Polo", date and extra elements in the table. Well, there was only one element and this loop will run only once. That's where you're extracting the text (everything within the TAGS) in h1 (Heading) and h3 (Date).

Next part extracts a tag called table, with conditions on each row's classes - they can be even or odd. Well, because you need all the rows in this table, that part is not doing anything here.

You could replace the line

for row in section.xpath('table/tr[@class="even" or @class="odd"]'):

with

for row in section.xpath('table/tr'):

Now when we are inside the loop, it will return us each 'td' element - each cell in that row. That's why the last line says row.xpath('td'). When you iterate over them, you'll receive multiple cell elements, e.g. each for 1, Providence, 49, 26-13-2, 997, 15. Check first line in the webpage table.

Try this for yourself. Replace the last loop block with this much easier to read alternative:

for row in section.xpath('table/tr'):
    print row.xpath('td//text()')

You will see that it presents all the table data in Pythonic lists - each list item containing one cell. Your code is just another fancier way to write these list items converted into a string with spaces between them. xpath() method returns objects of Element type which are representation of each XML/HTML element. xpath('something//text()') would produce the actual content within that tag.

Here're a few helpful references:

Easy to understand tutorial : http://www.w3schools.com/xpath/xpath_examples.asp

Stackoverflow question : Extract text between tags with XPath including markup

Another tutorial : http://www.tutorialspoint.com/xpath/