Python 3.4 : LXML web scraping

Question

Python 3.4 : LXML web scraping

460 views Asked by Aran Freel At 09 June 2015 at 15:13

I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?

from lxml import html
import requests


url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

resp = requests.get(url)
tree = html.fromstring(resp.text)

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')

print(tickers)

Original Q&A

There are 1 answers

**Martijn Pieters** · Accepted Answer · 2015-06-09T15:23:00+00:00

Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in.

The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the <thead> element; again, the original HTML is lacking it, but Chrome put it in and put the one <tr> row with <th> elements in it.

As such the correct XPath expression is:

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')

e.g. the second row in the table, first table cell in that row.

Note that lxml can load URLs directly; you don't really need to use requests in this specific case:

>>> from lxml import html
>>> url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
>>> tree = html.parse(url)
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
[<Element a at 0x10445e628>]
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].text
'MMM'
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].attrib['href']
'https://www.nyse.com/quote/XNYS:MMM'

If you wanted to extract all <a> elements in that first column, you'd have to remove the restriction on the <tr> element; your XPath picks all, remove the [1] to select all:

links = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr/td[1]/a')
for link in links:
    print(link.text, link.attrib['href'])

TechQA.

Python 3.4 : LXML web scraping

There are 1 answers

Related Questions in PYTHON

Related Questions in LXML

Popular Questions

Popular Tags

Trending Questions