I would like extract the first 100 results (say) of a Google Scholar search using R. Does anyone know how to do it?
To be precise, I just need the name of the paper, authors and citation count.
Ps Would this be legal?
I would like extract the first 100 results (say) of a Google Scholar search using R. Does anyone know how to do it?
To be precise, I just need the name of the paper, authors and citation count.
Ps Would this be legal?
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                I can't speak to the legalities of your task, but there are a few ways you can go about this. While I am not strong in XPath, it might be the best way. I believe that you can use the XML package to retrieve the page contents and use XPath to extract the data of the elements you need.
For instance, I use Chrome for a browser, and when I inspected the page with Developer Tools, there does appear to be a structure to the page, with the data "hidden" inside various tags that should you be able to exploit really easily using XPath.
Check out this link for an example of using XPath.
HTH and Good Luck
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                There are some Python and Perl scrapers out there that you might be able to adapt, linked at http://bmb-common.blogspot.com/2011/02/does-google-scholar-suck-or-am-i-just.html
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                You can definitely retrieve the HTML content of the page using RCurl and parse them using RXML as suggested by Btibert3. The only issue you might face is that Google won't allow you to do queries in a "robotic" way. After something like 200 queries in Google in a short period of time, it won't return results anymore. Maybe that's different with Google Scholar, but I doubt so...
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                A solution was recently published here:
http://thebiobucket.blogspot.com/2011/11/visually-examine-google-scholar-search.html
please consider the updated biobucket-post:
http://thebiobucket.blogspot.com/2011/11/r-function-google-scholar-webscraper.html