Python 3.4 : LXML web scraping

469 views Asked by At

I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?

from lxml import html
import requests


url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

resp = requests.get(url)
tree = html.fromstring(resp.text)

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')

print(tickers)
1

There are 1 answers

6
Martijn Pieters On BEST ANSWER

Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in.

The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the <thead> element; again, the original HTML is lacking it, but Chrome put it in and put the one <tr> row with <th> elements in it.

As such the correct XPath expression is:

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')

e.g. the second row in the table, first table cell in that row.

Note that lxml can load URLs directly; you don't really need to use requests in this specific case:

>>> from lxml import html
>>> url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
>>> tree = html.parse(url)
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
[<Element a at 0x10445e628>]
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].text
'MMM'
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].attrib['href']
'https://www.nyse.com/quote/XNYS:MMM'

If you wanted to extract all <a> elements in that first column, you'd have to remove the restriction on the <tr> element; your XPath picks all, remove the [1] to select all:

links = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr/td[1]/a')
for link in links:
    print(link.text, link.attrib['href'])