I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?
from lxml import html
import requests
url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
resp = requests.get(url)
tree = html.fromstring(resp.text)
tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')
print(tickers)
Browsers add in missing HTML elements that the HTML specification states are part of the model.
lxml
does not add those in.The most common such element is the
<tbody>
element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the<thead>
element; again, the original HTML is lacking it, but Chrome put it in and put the one<tr>
row with<th>
elements in it.As such the correct XPath expression is:
e.g. the second row in the table, first table cell in that row.
Note that
lxml
can load URLs directly; you don't really need to userequests
in this specific case:If you wanted to extract all
<a>
elements in that first column, you'd have to remove the restriction on the<tr>
element; your XPath picks all, remove the[1]
to select all: