use rvest and css selector to extract table from scraped search results

Question

use rvest and css selector to extract table from scraped search results

1.6k views Asked by C8H10N4O2 At 15 June 2015 at 16:43

Just learned about rvest on Hadley's great webinar and trying it out for the first time.

I want to scrape (and then plot) the baseball standings table returned from a Google search result.

My problem is that I cannot get in rvest the table I see in my browser plug-in.

enter image description here

library(rvest)
library(magrittr) # for %>% operator

( g_search <-html_session(url = "http://www.google.com/?q=mlb+standings", 
                          add_headers("user-agent" = "Mozilla/5.0")) )
# <session> http://www.google.com/?q=mlb+standings
#   Status: 200
#   Type:   text/html; charset=UTF-8
#   Size:   52500

This search should return a page with a table buried under many layers but uniquely identified by <div class="tb_strip">. A quick stop at the CSS Diner teaches me (I think) that "div.tb_strip" is a valid CSS selector to capture this table (and possibly other junk). And indeed, using Firebug's CSS selector, I see the full path:

# Use Firebug "Copy CSS Path" and paste into table_path
table_path <- "html body#gsr.srp.tbo.vasq div#main div#cnt.big div.mw div#rcnt div.col div#center_col div#res.med div#search div div#ires ol#rso li.g.tpo.knavi.obcontainer div.kp-blk div#uid_0.r-iCGI_bFBahQE.xpdbox.xpdopen div div.lr_container.mod div#lr_tab_unit_uid_1.tb_u.r-igQv_rxlT08k div.tb_view div.tb_strip"

However, the following attempt to access this table fails due to html_nodes returning an empty list.

( standings <- g_search %>% 
    html_nodes("div.tb_strip") %>% 
    html_table() 
  ) #returns empty list

The content does not seem to be making it into g_search, so I don't know yet whether the CSS selector worked.

grep("tb_strip",html_text(read_html("http://www.google.com/?q=mlb+standings")) ) # empty

Where did it go?

TYVM

Original Q&A

There are 1 answers

**cory** · Accepted Answer · 2015-06-15T17:16:09+00:00

cory On 15 June 2015 at 17:16 BEST ANSWER

Here's an example from an easier site...

library("rvest")
url <- "http://sports.yahoo.com/mlb/standings/"
html(url) %>% html_nodes(".yui3-tabview-content") %>% html_nodes("table") %>%html_table

TechQA.

use rvest and css selector to extract table from scraped search results

There are 1 answers

Related Questions in HTML

Related Questions in CSS

Related Questions in R

Related Questions in RVEST

Popular Questions

Popular Tags

Trending Questions