Means of getting data for a given website from the Web Data Commons?

510 views Asked by At

I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?

1

There are 1 answers

0
Chris On

To get all of the pages from a particular domain -- one option is to query the common crawl api site:

http://index.commoncrawl.org

To list all of the pages from the specific domain wikipedia.org:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org*/&showNumPages=true

This shows you how many pages of blocks common crawl has from this domain (note you can use wildcards as in this example).

Then go into each page and ask common crawl to send you a json object of each file:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=en.wikipedia.org/*&page=0&output=json

You can then parse the json and get each warc file through the field: filename

This link will help you.