Extracting specific articles and their talk pages from a Wikipedia dump

Question

Extracting specific articles and their talk pages from a Wikipedia dump

532 views Asked by Tridhara Chakrabarti At 02 August 2020 at 14:56

I am a completely new to web crawling. I have the following Wikipedia dump link https://dumps.wikimedia.org/backup-index.html. I have a list of article titles. They are all in English.

I need to download those articles and their talk pages from the given dumps. Kindly let me know where to start from.

Original Q&A

There are 1 answers

**Martin Urbanec** · Answer 1 · 2020-08-03T21:34:57+00:00

That depends a lot on your usecase. Do you have a relatively small set (let's say, few hundreds) of pages to fetch? Go for API, it can give you both wikitext and HTML, while the dumps will give only wikitext to you.

If you need to go dumps, or just want to learn how to deal with them the best way, https://en.wikipedia.org/wiki/Wikipedia:Database_download#How_to_use_multistream? might be a good study material.

TechQA.

Extracting specific articles and their talk pages from a Wikipedia dump

There are 1 answers

Related Questions in WIKIPEDIA

Related Questions in WIKIMEDIA-DUMPS

Popular Questions

Popular Tags

Trending Questions