How can I screen-scrape the HTML result of a non-trivial user scenario

Question

How can I screen-scrape the HTML result of a non-trivial user scenario

388 views Asked by Peter Howe At 09 November 2011 at 13:58

I want to be able to get the HTML for a page which, if I was doing it interactively in a browser, would involve multiple actions and page loads: 1. Go to homepage 2. Enter text into a login form and submit the form (post) 3. The post will go through various redirections and frameset usage.

Cookies are adapted throughout this process.

In the browser, after submitting, I just get the page.

But to do this with curl (in PHP or whatever) or wget or A.N.Other low level technology, the management of cookies, redirections and framesets all becomes quite a chore and very tightly binds my script to the website (making it very susceptible to even small changes in the website that I'm scraping from.)

Can anyone suggest a way to do this?

I've already looked at Crowbar and PhantomJS and Lynx (with cmd_log/cmd_script options) but chaining everything together to mimic exactly what I'd do in Firefox or Chrome is difficult.

(As an aside, it might even be useful/necessary for the target website to think this script is Firefox or Chrome or a "real" browser)

Original Q&A

There are 3 answers

seagulf On 11 November 2011 at 20:59

You can use irobot at irobotsoft to record a robot and replay it.

If you prefer low-level control, you can use HTQL python interface, see: http://htql.net/htql-python-manual.pdf. It allows you to access an IE-based browser from python.

hoju On 14 November 2011 at 16:25

Use a tool like Firebug to check what headers are submitted to the website for login, and then replicate that exactly in your code.

Or just login with your browser and then reuse the cookie in your code.

**Patrice Neff** · Accepted Answer · 2011-11-09T14:05:09+00:00

Patrice Neff On 09 November 2011 at 14:05 BEST ANSWER

One way to do this is using Selenium RC. While it's usually used for testing, at it's core it's just a browser remote control service.

Use this web site as a starting point: http://seleniumhq.org/projects/remote-control/

TechQA.

How can I screen-scrape the HTML result of a non-trivial user scenario

There are 3 answers

Related Questions in HTTP

Related Questions in CURL

Related Questions in SCREEN-SCRAPING

Related Questions in LYNX

Related Questions in PHANTOMJS

Popular Questions

Popular Tags

Trending Questions