How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

324 views Asked by At

For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone point the way how to download 100K text files or how to access them (if it's easier than downloading them)?

1

There are 1 answers

0
UriCS On BEST ANSWER

It seems it is possible to download only parts of the DataSet (you can just select the month you want), and you can download only the text (called WET files). for example, you can download the August 2014 Crawl Data from: http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/ and an explanation about the file format can be found here: http://blog.commoncrawl.org/2014/04/navigating-the-warc-file-format/