How do I scrape a site, with multiple pages, and create one single html page with Ruby?

544 views Asked by At

So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/ and create one HTML page that I can either print or send to my Kindle.

I am thinking of using Hpricot, but am not too sure how to proceed.

How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?

You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.

Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?

If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?

Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).

Would really appreciate some guidance.

Thanks.

1

There are 1 answers

0
Tilo On BEST ANSWER

I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.

I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.

Check this RailsCast:

http://railscasts.com/episodes/190-screen-scraping-with-nokogiri

and:

http://nokogiri.org/

http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html

http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/