representation of web page as browser see it

134 views Asked by At

I have some ideas of how to build a more intelligent web spider, which interacts with a web page and extracts information in a manner more similar to how us humans do.

To do this I need a representation of a web page that is similar or identical to that we see in our browsers

In other words I need access to the data concerning the location, colour and style of all the elements on the page, possibly at a pixel level.

But I don't want just a rendered bitmap, I want to be able to extract text, click links and push buttons and so on

I get the feeling the DOM model may be a starting point but more concrete advice would be appreciated

To clarify, I want to programmatically obtain access to web pages in a form similar to that presented to us by a browser, but for example to check the colour or text at a specific pixel location or region.

1

There are 1 answers

2
Christopher Creutzig On

You might want to check out Selenium (or other ways of scripting your browser, such as greasemonkey). Since how a web page is displayed depends quite a bit on the particular browser, scripting one is obviously the most precise way of getting to what the user sees.