I'm trying to scrape TD Asset Management pages (example below; I can't post more than two links) in order to retrieve the "price as on" value, i.e. the dollar amount in this snippet of HTML:
<div class="td-layout-grid9 td-layout-column td-layout-column-first">
Price As On: Jun 12, 2015
<br>
<strong>$14.54 </strong>
<strong class="td-copy-red">-0.01 (-0.07%)</strong>
</div>
I was hoping to achieve this with Python, requests, lxml, and XPath, which I installed as follows:
apt-get update
apt-get install python python-pip python-dev gcc build-essential libxml2-dev libxslt-dev libffi-dev libssl-dev
pip install lxml
pip install requests
pip install requests[security]
Next, to retrieve the page I did this:
python
>>> from lxml import html
>>> import requests
>>> page = requests.get('https://www.tdassetmanagement.com/fundDetails.form?fundId=6320&lang=en')
>>> tree = html.fromstring(page.text)
Finally, an attempt was made to retrieve the desired dollar value using the XPath of the relevant element as obtained from Chrome's "Inspect Element" tool:
>>> price = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> print price
Unfortunately the result is [<Element strong at 0x29a9998>]
rather than the expected dollar amount $14.54
.
To ensure that the expected data was retrieved by the initial "requests.get", I ran this:
>>> print page.content
The result can be seen here: http://pastebin.com/f5C4MFQb.
If I paste the above HTML into this tool: http://videlibri.sourceforge.net/cgi-bin/xidelcgi my XPath query //*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]
returns the dollar amount as expected.
Any hints or tips as to how I might be able to use Python, lxml, and XPath to retrieve the desired value for this element would be very much appreciated. If there's a completely different way that I could be going about this to obtain the same result I would be interested in that too.
Thanks.
After further Googling to find out what elements are (they're lists of things with attributes like
tag
ortext
), followed by more Googling regarding aUnicodeEncodeError
(see UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)) I was able to obtain my desired value with this:Thanks for nudging me in the right direction jonrsharpe.
I still was not able to determine how to obtain a list of available attributes for the element though, but
tag
andtext
were available.I went on to get just the number (without the dollar symbol and trailing non-breaking spaces) with this: