Parse an HTML table with Nokogiri in Ruby

100 views Asked by At

I have an HTML table that looks like the following:

<table id="TTdata" border="0" cellspacing="0" cellpadding="3" align="center">
   <tbody>
      <tr class="TTdata_ltblue">
         <td class="ctr"><b>#</b></td>
         <td class="ctr"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=YEAR">YEAR</a><img src="/images/up.gif"></b></td>
         <td class="ctr" title="Player's name."><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=NAME">NAME</a></b></td>
         <td class="ctr" title="how many pitches a catcher had a chance/need to frame"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=FR_CHANCES">FR_CHANCES</a></b></td>
         <td class="ctr" title="the number of strikes the catcher is expected to have received according to RPM"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=PREDICTED_STRIKES">PREDICTED_STRIKES</a></b></td>
         <td class="ctr" title="the number of strikes the catcher actually received"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=ACTUAL_STRIKES">ACTUAL_STRIKES</a></b></td>
         <td class="ctr" title="the difference between actual and predicted strikes received by the catcher"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=EXTRA_STRIKES">EXTRA_STRIKES</a></b></td>
         <td class="ctr" title="runs RPM credits to the catcher, using the ball-strike context to calculated run value"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=FR_RUNS_ADDED_BY_COUNT">FR_RUNS_ADDED_BY_COUNT</a><img src="/images/down.gif"></b></td>
         <td class="ctr" title="how many runs RPM would assign using a generic .14 runs available per frame"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=FR_RUNS_ADDED_BY_CALL">FR_RUNS_ADDED_BY_CALL</a></b></td>
         <td class="ctr" title="pitches the catcher received that could have resulted in a wild pitch or passed ball; this is when runners are on base or a dropped third strike is possible"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=BL_CHANCES">BL_CHANCES</a></b></td>
         <td class="ctr"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=PREDICTED_PBWP">PREDICTED_PBWP</a></b></td>
         <td class="ctr" title="the run value accumulated from preventing wild pitches and passed balls (.28 per PB/WP saved)"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=BL_RUNS_ADDED">BL_RUNS_ADDED</a></b></td>
         <td class="ctr" title="the number of passed balls and wild pitches allowed by the catcher"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=ACTUAL_PBWP">ACTUAL_PBWP</a></b></td>
         <td class="ctr" title="the difference between actual and predicted passed balls and wild pitches allowed by the catcher
            "><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=PBWP_SAVED">PBWP_SAVED</a></b></td>
      </tr>
      <tr class="TTdata">
         <td>1.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Yasmani+Grandal" target="_blank">Yasmani Grandal</a></td>
         <td class="right">2295</td>
         <td class="right">871.5</td>
         <td class="right">925</td>
         <td class="right">53.5</td>
         <td class="right">8.0</td>
         <td class="right">8.0</td>
         <td class="right">1097</td>
         <td class="right">18.0</td>
         <td class="right">0.0</td>
         <td class="right">18</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>2.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Buster+Posey" target="_blank">Buster Posey</a></td>
         <td class="right">2601</td>
         <td class="right">1,011.4</td>
         <td class="right">1,056</td>
         <td class="right">44.6</td>
         <td class="right">6.6</td>
         <td class="right">6.6</td>
         <td class="right">1232</td>
         <td class="right">10.0</td>
         <td class="right">0.0</td>
         <td class="right">10</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>3.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Francisco+Cervelli" target="_blank">Francisco Cervelli</a></td>
         <td class="right">2629</td>
         <td class="right">989.0</td>
         <td class="right">1,033</td>
         <td class="right">44.0</td>
         <td class="right">6.5</td>
         <td class="right">6.5</td>
         <td class="right">1357</td>
         <td class="right">14.0</td>
         <td class="right">0.0</td>
         <td class="right">14</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>4.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Mike+Zunino" target="_blank">Mike Zunino</a></td>
         <td class="right">2828</td>
         <td class="right">1,128.8</td>
         <td class="right">1,169</td>
         <td class="right">40.2</td>
         <td class="right">6.0</td>
         <td class="right">6.0</td>
         <td class="right">1325</td>
         <td class="right">19.0</td>
         <td class="right">0.0</td>
         <td class="right">19</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>5.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Caleb+Joseph" target="_blank">Caleb Joseph</a></td>
         <td class="right">2713</td>
         <td class="right">993.9</td>
         <td class="right">1,031</td>
         <td class="right">37.1</td>
         <td class="right">5.5</td>
         <td class="right">5.5</td>
         <td class="right">1315</td>
         <td class="right">9.0</td>
         <td class="right">0.0</td>
         <td class="right">9</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>6.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Chris+Iannetta" target="_blank">Chris Iannetta</a></td>
         <td class="right">2158</td>
         <td class="right">847.5</td>
         <td class="right">884</td>
         <td class="right">36.5</td>
         <td class="right">5.4</td>
         <td class="right">5.4</td>
         <td class="right">1078</td>
         <td class="right">15.0</td>
         <td class="right">0.0</td>
         <td class="right">15</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>7.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Jason+Castro" target="_blank">Jason Castro</a></td>
         <td class="right">2679</td>
         <td class="right">1,068.9</td>
         <td class="right">1,105</td>
         <td class="right">36.1</td>
         <td class="right">5.4</td>
         <td class="right">5.4</td>
         <td class="right">1378</td>
         <td class="right">18.0</td>
         <td class="right">0.0</td>
         <td class="right">18</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>8.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Miguel+Montero" target="_blank">Miguel Montero</a></td>
         <td class="right">1977</td>
         <td class="right">785.8</td>
         <td class="right">820</td>
         <td class="right">34.2</td>
         <td class="right">5.1</td>
         <td class="right">5.1</td>
         <td class="right">972</td>
         <td class="right">11.0</td>
         <td class="right">0.0</td>
         <td class="right">11</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>9.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Martin+Maldonado" target="_blank">Martin Maldonado</a></td>
         <td class="right">2343</td>
         <td class="right">906.0</td>
         <td class="right">940</td>
         <td class="right">34.0</td>
         <td class="right">5.1</td>
         <td class="right">5.1</td>
         <td class="right">1193</td>
         <td class="right">17.0</td>
         <td class="right">0.0</td>
         <td class="right">17</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>10.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Tyler+Flowers" target="_blank">Tyler Flowers</a></td>
         <td class="right">2191</td>
         <td class="right">833.4</td>
         <td class="right">865</td>
         <td class="right">31.6</td>
         <td class="right">4.7</td>
         <td class="right">4.7</td>
         <td class="right">1305</td>
         <td class="right">13.0</td>
         <td class="right">0.0</td>
         <td class="right">13</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>11.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Rene+Rivera" target="_blank">Rene Rivera</a></td>
         <td class="right">2632</td>
         <td class="right">1,043.1</td>
         <td class="right">1,070</td>
         <td class="right">26.9</td>
         <td class="right">4.0</td>
         <td class="right">4.0</td>
         <td class="right">1331</td>
         <td class="right">18.0</td>
         <td class="right">0.0</td>
         <td class="right">18</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>12.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Russell+Martin" target="_blank">Russell Martin</a></td>
         <td class="right">2919</td>
         <td class="right">1,121.3</td>
         <td class="right">1,148</td>
         <td class="right">26.7</td>
         <td class="right">4.0</td>
         <td class="right">4.0</td>
         <td class="right">1470</td>
         <td class="right">27.0</td>
         <td class="right">0.0</td>
         <td class="right">27</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>13.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Kevin+Plawecki" target="_blank">Kevin Plawecki</a></td>
         <td class="right">1826</td>
         <td class="right">744.0</td>
         <td class="right">770</td>
         <td class="right">26.0</td>
         <td class="right">3.9</td>
         <td class="right">3.9</td>
         <td class="right">886</td>
         <td class="right">9.0</td>
         <td class="right">0.0</td>
         <td class="right">9</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>14.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=David+Ross" target="_blank">David Ross</a></td>
         <td class="right">941</td>
         <td class="right">339.6</td>
         <td class="right">361</td>
         <td class="right">21.4</td>
         <td class="right">3.2</td>
         <td class="right">3.2</td>
         <td class="right">519</td>
         <td class="right">5.0</td>
         <td class="right">0.0</td>
         <td class="right">5</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>15.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Roberto+Perez" target="_blank">Roberto Perez</a></td>
         <td class="right">1969</td>
         <td class="right">776.5</td>
         <td class="right">789</td>
         <td class="right">12.5</td>
         <td class="right">1.9</td>
         <td class="right">1.9</td>
         <td class="right">1090</td>
         <td class="right">12.0</td>
         <td class="right">0.0</td>
         <td class="right">12</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>16.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Welington+Castillo" target="_blank">Welington Castillo</a></td>
         <td class="right">1047</td>
         <td class="right">410.6</td>
         <td class="right">420</td>
         <td class="right">9.4</td>
         <td class="right">1.4</td>
         <td class="right">1.4</td>
         <td class="right">499</td>
         <td class="right">4.0</td>
         <td class="right">0.0</td>
         <td class="right">4</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>17.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Hank+Conger" target="_blank">Hank Conger</a></td>
         <td class="right">1000</td>
         <td class="right">405.2</td>
         <td class="right">414</td>
         <td class="right">8.8</td>
         <td class="right">1.3</td>
         <td class="right">1.3</td>
         <td class="right">511</td>
         <td class="right">4.0</td>
         <td class="right">0.0</td>
         <td class="right">4</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata_ltgrey">
         <td>18.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Josh+Thole" target="_blank">Josh Thole</a></td>
         <td class="right">476</td>
         <td class="right">168.8</td>
         <td class="right">177</td>
         <td class="right">8.2</td>
         <td class="right">1.2</td>
         <td class="right">1.2</td>
         <td class="right">275</td>
         <td class="right">4.0</td>
         <td class="right">0.0</td>
         <td class="right">4</td>
         <td class="right">0.0</td>
      </tr>
      <tr class="TTdata">
         <td>19.</td>
         <td class="right">2015</td>
         <td><a href="/player_search.php?search_name=Tucker+Barnhart" target="_blank">Tucker Barnhart</a></td>
         <td class="right">934</td>
         <td class="right">351.4</td>
         <td class="right">357</td>
         <td class="right">5.6</td>
         <td class="right">0.8</td>
         <td class="right">0.8</td>
         <td class="right">410</td>
         <td class="right">4.0</td>
         <td class="right">0.0</td>
         <td class="right">4</td>
         <td class="right">0.0</td>
      </tr>
   </tbody>
</table>

In this case, I'm interested in retrieving every "player" that is in a table row with either the class of TTdata or TTdata_ltgrey. This can be achieved using the following:

html = open(url)
doc = Nokogiri::HTML(html)

doc.css('.TTdata, .TTdata_lgrey').each do |catcher|
   # parse here
end

My problem is, none of the td entries have classes associated with them. I just know that TD 1 is a position, TD 2 is a year, TD 3 is a name.

What's the right way to access each td using the iteration above so I can create a model/hash of name/val pairs for each row?

1

There are 1 answers

5
Arup Rakshit On BEST ANSWER

Here is one approach I tried. But yes, you can take it further from here to meet the need you have :

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML.parse(File.read("#{__dir__}/out1.html"))

data = doc.css('.TTdata, .TTdata_lgrey').map do |tr|
  %i(position year name).zip(tr.css("td:nth-child(-n+3)").map(&:text)).to_h
end

pp data

output

[{:position=>"1.", :year=>"2015", :name=>"Yasmani Grandal"},
 {:position=>"3.", :year=>"2015", :name=>"Francisco Cervelli"},
 {:position=>"5.", :year=>"2015", :name=>"Caleb Joseph"},
 {:position=>"7.", :year=>"2015", :name=>"Jason Castro"},
 {:position=>"9.", :year=>"2015", :name=>"Martin Maldonado"},
 {:position=>"11.", :year=>"2015", :name=>"Rene Rivera"},
 {:position=>"13.", :year=>"2015", :name=>"Kevin Plawecki"},
 {:position=>"15.", :year=>"2015", :name=>"Roberto Perez"},
 {:position=>"17.", :year=>"2015", :name=>"Hank Conger"},
 {:position=>"19.", :year=>"2015", :name=>"Tucker Barnhart"}]