Insert element[s] (<td>) into each row of HTML Table using XMLStarlet via Bash

625 views Asked by At

I want to extract each html table from a list of links. The code I use is the following:

wget -O - "https://example.com/section-1/table-name/financial-data/" | xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null >> /Applications/parser/output.txt

This works perfectly fine, however, given that this is not the only table I want to extract it will give me difficulties identifying which financial-data belongs to which table. In this case scenario, it will only parse one table that is appended to that output file where the SDTOUT looks like this:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
...
</tbody>

But I am looking for this:

<tbody>

                    <tr class="text-right">
                      <td>TABLE-NAME</td>
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td>TABLE-NAME</td>
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
...
</tbody>

Where the TABLE-NAME is the name of the specific asset. The name can be extracted either using the XPath /html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text() which appears in the same URL where the table is, or from the link itself /table-name/.

I cannot figure out the syntax.

NB: I purposely omitted the -q flag in the wget command as I want to see what is happening in the Terminal at the moment the script is executed.

Thanks!


UPDATE

According to @DanielHaley this can be done through XMLStarlet, however, when I read through the documentation I could not find an example of how to use it.

What is the correct syntax? Do I first have to parse the HTML table via xmllint --html --xpath and then apply xmlstarlet afterwards?

This is what I've found so far:

-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>

NEW UPDATE

According to this link, I came across the script that adds a subnode easily like this:

wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "Hello World" >> /Applications/parser/output.txt

Which writes the following to STDOUT:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                      <td>Hello World</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                      <td>Hello World</td>
                    </tr>
...
</tbody>

So far so good, however, this reproduces some default text declared as a text string using the option -v, i.e. in this case scenario "Hello World". I'm hoping to replace this text string with the actual name of the asset. As stated previously, the TABLE-NAME is found in the same page where the table is and can be accessed via the other XPath, hence I tried the following code:

wget -O - "https://example.com/section-1/table-name/financial-data/" |
header=$(xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1' -) |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt

Here you can clearly see that I tried declaring a variable $header that shall include the name of the asset. This does not work and leaves my output file empty, probably because the declaration is wrong or the pipe's syntax is not correct.

How can I insert the according XPath (that references to the name of the asset) into the newly created subnode <td>? A variable is the first thing that I came up with; can it be done elsewise?

3

There are 3 answers

1
Ava Barbilla On BEST ANSWER

This script works but is inefficient; it needs some editing:

name_query="html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text()"

# Use xargs to TRIM result.
header=$(wget -O - "https://example.com/section-1/name-1/financial-data/" |
    xmllint --html --xpath "$name_query" - 2>/dev/null |
    xargs)

wget -O - "https://example.com/section-1/name-1/financial-data/" |
    xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
    xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt

This makes two requests:

  1. Fetch the name and pass it to variable $header
  2. Get the table and append a subnode <td>$header</td>

Hence, this writes the following to my output.txt file:

<tbody>

                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>     
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                      <td>Name 1</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>     
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                      <td>Name 1</td>
                    </tr>
...
</tbody>

It's relatively slow because this can actually be done using one request only, but I can't figure out how.

3
Walter A On

You should try to insert the additional column before appending the output to output.txt. Make sure the tablename you need is stored in a variable. You want to do something like

tbl=testtbl
echo "<tbody>
                    <tr class="text-right">
                      <td class="text-left">Sep 08, 2017</td>
                      <td>4605.16</td>
                      <td>4661.00</td>
                      <td>4075.18</td>
                      <td>4228.75</td>
                      <td>2,700,890,000</td>
                      <td>76,220,200,000</td>
                    </tr>

                    <tr class="text-right">
                      <td class="text-left">Sep 07, 2017</td>
                      <td>4589.14</td>
                      <td>4655.04</td>
                      <td>4491.33</td>
                      <td>4599.88</td>
                      <td>1,844,620,000</td>
                      <td>75,945,000,000</td>
                    </tr>
" | sed 's#.*<tr.*#&\n      <td>'"${tbl}"'</td>#'

In the sed command the normal slashes are replaced by '#', so you do not to escape the slash in </td>.
When you have a file alltables.txt with the apporox. 1160 tables, you van make a loop like this:

while IFS= read -r tbl; do
   wget -O - "https://example.com/section-1/table-name/financial-data/" |
      xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
      sed 's#.*<tr.*#&\n      <td>'"${tbl}"'</td>#' >> /Applications/parser/output.txt
done < alltables.txt
2
Daniel Haley On

You could probably do this with the ed (edit) command in xmlstarlet, but I don't know xmlstarlet well enough to give you an easy answer.

Also, like you said, it looks like you'd have to pass the HTML through either xmllint or use the fo xmlstarlet command before passing it to xmlstarlet ed. It doesn't look like ed supports --html.

What I would do is use the xmlstarlet tr (transform) command with an XSLT stylesheet.

It's very verbose, but it's much safer than trying to parse HTML/XML with regex. It's also a lot easier to extend.

Here's the XSLT. I added comments to try to help you understand what's happening.

XSLT 1.0 (stylesheet.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" omit-xml-declaration="yes"/>
  <xsl:strip-space elements="*"/>

  <!--Parameter to capture the table name. This is set on the command line.-->
  <xsl:param name="tablename"/>

  <!--Identity transform. Will basically output attributes/nodes without 
  change if not matched by a more specific template.-->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!--Template matching the root element. I do this to narrow the scope of what's 
  being processed.-->
  <xsl:template match="/*">
    <!--Process tbody.-->
    <xsl:apply-templates select=".//*[@id='financial-data']/div/table/tbody"/>
  </xsl:template>

  <!--Match tr elements so we can add the new td with the table name.-->
  <xsl:template match="tr">
    <!--Output the tr element.-->
    <xsl:copy>
      <!--Process any attributes.-->
      <xsl:apply-templates select="@*"/>
      <!--Create new td element.-->
      <td><xsl:value-of select="$tablename"/></td>
      <!--Process any children of tr.-->
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

command line

wget -O - "https://example.com/section-1/table-name/financial-data/" | 
xml tr --html stylesheet.xsl -p tablename="/html/body/div[3]/div/div[1]/div[3]/div[1]/h1"

I was able to test this locally by using cat on a local html file instead of wget. Let me know if you want me to add the test file/result to my answer.