How do I gather data from other websites for an app?

1.4k views Asked by At

I'm trying to build a news hub app, and my objective is to extract news articles from other news channels, summarize it, and present in the form of bullets in an unbiased manner. I've got the algorithm up and running, all I need is the code to gather data from other websites like NDTV, CNN, etc. Please gimme a description of how to carry this out.

Code, links, examples, and screen shots would help a lot. Thanks! (Y)

2

There are 2 answers

0
Alireza Sanaee On

webscraping is the way for you; you can get your news articles or everything you need with scrapy , beautifulsoup or selenium they are modules for python for getting data from html pages( text ) and after that you are able to save your data to anywhere you want such as databases ; it's better to use rss pages for headlines and these things that you consider to get.

0
unifreak On

There is a php lib called QueryList(http://git.oschina.net/jae/QueryList), it use phpQuery internally, and use some css selector filter array to fetch the specific content in certain url.

the doc is in Chinese(I don't think there is a English version), but it's quite simple to use:

<?php
// include the lib
require_once('QueryList.class.php');

// url to fetch content
$url = 'http://www.example.com/index.html';

// filter rules using css selector grammar
$regArr = array(
    'time' => array('td:nth-child(2)', 'text'),
    'summary' => array('td:nth-child(3) td:nth-child(3)', 'text'),
    'imgSrc' => array('h1 > a > img', 'src')
    );

// optional, firstly find `.divbox > table`, then find the things defined by $regArr in each `.divbox > table`
$regRange = '.divbox > table';

// do the query
$result = QueryList::Query($url, $regArr, $regRange);

// the result will be an array like:
/** Array
 * (
 *    [0] => Array
 *    (
 *        'time' => ,
 *        'summary' => ,
 *        'imgSrc' =>
 *    )
 *    [1] => Array
 *    (
 *        'time' => ,
 *        'summary' => ,
 *        'imgSrc' =>
 *    )
 *    ...
 * )
 */
echo '<pre>';
print_r($result->jsonArr);
echo '</pre>';

you can also define a exclude pattern and a callback function in $regArr, I think this will meet your requirment.