xpath: extract the trailing text of a node

185 views Asked by At

I have an html file of the following content.

...
<table><tbody>
...
            <tr>
              <td><span class="myclass">C</span>
                <a href="/myurl" title="myclick">mytext</a>
                   tailing text
              </td>
            </tr>
...
</tbody></table>
...

I would like to extract the info and write to a TSV file in the following format.

C<TAB>mytext<T>tailing text

So far, I can only figure this xpath code to extract the first two columns. Could anybody show me how to extract the 3rd column? Thanks.

xidel -s -e '//table/tbody/tr/td/join((span, a), x:cps(9))' - < infile.html
2

There are 2 answers

6
Martin Honnen On

If you use //table/tbody/tr/td/string-join(node()[normalize-space()], x:cps(9)) you get three columns but the last might contain whitespace before and after the text so perhaps //table/tbody/tr/td/string-join(node()[normalize-space()]/normalize-space(), x:cps(9)) is ensuring you don't get whitespace you haven't shown in your desired result.

5
zx485 On

You can use this command:

xidel infile.html --xpath '//table/tbody/tr/td/string-join((span, "<TAB>", a, "<T>", a/following::text()[1]))'

or

xidel --xpath '//table/tbody/tr/td/string-join((span, "<TAB>", a, "<T>", a/following::text()[1]))' - < infile.html

Another approach is

xidel infile.html --xpath '//table/tbody/tr/td/concat(span, "<TAB>", a, "<T>", a/following-sibling::text()[1])' 

The output is - in all three cases:

C<TAB>mytext<T>tailing text