How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

Question

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

329 views Asked by UriCS At 17 December 2014 at 20:09

For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone point the way how to download 100K text files or how to access them (if it's easier than downloading them)?

Original Q&A

There are 1 answers

**UriCS** · Accepted Answer · 2014-12-17T21:42:53+00:00

It seems it is possible to download only parts of the DataSet (you can just select the month you want), and you can download only the text (called WET files). for example, you can download the August 2014 Crawl Data from: http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/ and an explanation about the file format can be found here: http://blog.commoncrawl.org/2014/04/navigating-the-warc-file-format/

TechQA.

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

There are 1 answers

Related Questions in DOWNLOAD

Related Questions in LDA

Related Questions in GENSIM

Related Questions in COMMON-CRAWL

Popular Questions

Popular Tags

Trending Questions