use rvest and css selector to extract table from scraped search results

1.6k views Asked by At

Just learned about rvest on Hadley's great webinar and trying it out for the first time.

I want to scrape (and then plot) the baseball standings table returned from a Google search result.

My problem is that I cannot get in rvest the table I see in my browser plug-in.

enter image description here

library(rvest)
library(magrittr) # for %>% operator

( g_search <-html_session(url = "http://www.google.com/?q=mlb+standings", 
                          add_headers("user-agent" = "Mozilla/5.0")) )
# <session> http://www.google.com/?q=mlb+standings
#   Status: 200
#   Type:   text/html; charset=UTF-8
#   Size:   52500

This search should return a page with a table buried under many layers but uniquely identified by <div class="tb_strip">. A quick stop at the CSS Diner teaches me (I think) that "div.tb_strip" is a valid CSS selector to capture this table (and possibly other junk). And indeed, using Firebug's CSS selector, I see the full path:

# Use Firebug "Copy CSS Path" and paste into table_path
table_path <- "html body#gsr.srp.tbo.vasq div#main div#cnt.big div.mw div#rcnt div.col div#center_col div#res.med div#search div div#ires ol#rso li.g.tpo.knavi.obcontainer div.kp-blk div#uid_0.r-iCGI_bFBahQE.xpdbox.xpdopen div div.lr_container.mod div#lr_tab_unit_uid_1.tb_u.r-igQv_rxlT08k div.tb_view div.tb_strip"

However, the following attempt to access this table fails due to html_nodes returning an empty list.

( standings <- g_search %>% 
    html_nodes("div.tb_strip") %>% 
    html_table() 
  ) #returns empty list

The content does not seem to be making it into g_search, so I don't know yet whether the CSS selector worked.

grep("tb_strip",html_text(read_html("http://www.google.com/?q=mlb+standings")) ) # empty

Where did it go?

TYVM

1

There are 1 answers

1
cory On BEST ANSWER

Here's an example from an easier site...

library("rvest")
url <- "http://sports.yahoo.com/mlb/standings/"
html(url) %>% html_nodes(".yui3-tabview-content") %>% html_nodes("table") %>%html_table