Means of getting data for a given website from the Web Data Commons?

Question

Means of getting data for a given website from the Web Data Commons?

493 views Asked by user1556658 At 27 June 2015 at 22:14

I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?

Original Q&A

There are 1 answers

**Chris** · Answer 1 · 2015-08-11T21:53:17+00:00

To get all of the pages from a particular domain -- one option is to query the common crawl api site:

http://index.commoncrawl.org

To list all of the pages from the specific domain wikipedia.org:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org*/&showNumPages=true

This shows you how many pages of blocks common crawl has from this domain (note you can use wildcards as in this example).

Then go into each page and ask common crawl to send you a json object of each file:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=en.wikipedia.org/*&page=0&output=json

You can then parse the json and get each warc file through the field: filename

This link will help you.

TechQA.

Means of getting data for a given website from the Web Data Commons?

There are 1 answers

Related Questions in COMMON-CRAWL

Popular Questions

Popular Tags

Trending Questions