How to extract a text from multiple tags with Xpath (lxml)?

1.9k views Asked by At

Let say I have code like this:

<table>
  <tr>
    <td colspan=2>Date</td>
  </tr>
  <tr id='something'>
   <td>8 september</td>
   <td>2008</td>
  </tr>
</table>

I want to extract the date to have "8 september 2008".

2

There are 2 answers

2
Dimitre Novatchev On BEST ANSWER

A pure XPath 1.0 solution.

Use:

string(normalize-space(//table/tr[@id = 'something']))
0
unutbu On

You could collect the text from each td element, and join them with ' '.join(...):

import lxml.html as LH

content = '''
<table>
  <tr>
    <td colspan=2>Date</td>
  </tr>
  <tr id='something'>
   <td>8 september</td>
   <td>2008</td>
  </tr>
</table>
'''

doc = LH.fromstring(content)
date = ' '.join(td.text for td in doc.xpath('//table/tr[@id = "something"]/td'))
print(date)

yields

8 september 2008

Or, if you can handle the carriage returns, you could use the text_content() method:

for td in doc.xpath('//table/tr[@id = "something"]'):
    print(td.text_content())

yields

8 september
   2008