I want to extract each html table from a list of links. The code I use is the following:
wget -O - "https://example.com/section-1/table-name/financial-data/" | xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null >> /Applications/parser/output.txt
This works perfectly fine, however, given that this is not the only table I want to extract it will give me difficulties identifying which financial-data belongs to which table. In this case scenario, it will only parse one table that is appended to that output file where the SDTOUT looks like this:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
But I am looking for this:
<tbody>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
Where the TABLE-NAME is the name of the specific asset. The name can be extracted either using the XPath /html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text()
which appears in the same URL where the table is, or from the link itself /table-name/
.
I cannot figure out the syntax.
NB: I purposely omitted the -q
flag in the wget command as I want to see what is happening in the Terminal at the moment the script is executed.
Thanks!
UPDATE
According to @DanielHaley this can be done through XMLStarlet, however, when I read through the documentation I could not find an example of how to use it.
What is the correct syntax? Do I first have to parse the HTML table via xmllint --html --xpath
and then apply xmlstarlet
afterwards?
This is what I've found so far:
-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
NEW UPDATE
According to this link, I came across the script that adds a subnode easily like this:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "Hello World" >> /Applications/parser/output.txt
Which writes the following to STDOUT:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
<td>Hello World</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
<td>Hello World</td>
</tr>
...
</tbody>
So far so good, however, this reproduces some default text declared as a text string using the option -v
, i.e. in this case scenario "Hello World". I'm hoping to replace this text string with the actual name of the asset. As stated previously, the TABLE-NAME is found in the same page where the table is and can be accessed via the other XPath, hence I tried the following code:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
header=$(xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1' -) |
xmllint --html --xpath '//*[@id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt
Here you can clearly see that I tried declaring a variable $header
that shall include the name of the asset. This does not work and leaves my output file empty, probably because the declaration is wrong or the pipe's syntax is not correct.
How can I insert the according XPath (that references to the name of the asset) into the newly created subnode <td>
? A variable is the first thing that I came up with; can it be done elsewise?
This script works but is inefficient; it needs some editing:
This makes two requests:
$header
<td>$header</td>
Hence, this writes the following to my output.txt file:
It's relatively slow because this can actually be done using one request only, but I can't figure out how.