the best way to grab websites data (content)?

430 views Asked by At

I need to grab some websites data (content) those websites provide listings I need to grab those and filter them according to the content

any software can do that? php script? if not, where can I start to program this functionality?

3

There are 3 answers

0
Damien MATHIEU On

There's no magical thing. Because every page content is different.
As you talk about PHP, I'm going to give you some clues with this language.

You can fetch a web page using curl.
After getting the content, you can parse it using regular expressions.

Depending of what you want to do, you'll have to develop the application by yourself.

0
PurplePilot On

Use file_get_contents() which returns the whole file a string then parse the string to extract the content.

Other options would be cURL or wget which will get the whole file and then process them with such as AWK and SED or PERL

Depends how often you need to scrape the target page. If occasionaly then PHP, but you will need to trigger it from a browser and remeber regexp in PHP can be time consuming.

If you want to scrape the file on a regular basis then a BASH script with cURL/wget + sed and awk can be run from cron without intervention and in the background.

0
TigerTiger On

If its php .. may be this helps you .. http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial

// get the HTML
$html = file_get_contents("http://www.thefutureoftheweb.com/blog/");


preg_match_all(
    '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
    $html,
    $posts, // will contain the blog posts
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];
    $date = $post[3];
    $content = $post[4];

    // do something with data
}

Of course, you'll need to customise the regular expression depending upon your requirements.

Also loads of other examples you could find .. http://www.google.com/search?source=ig&hl=en&rlz=&=&q=php+web+scraper&aq=f&oq=&aqi=