download articles from wikipedia using special export

679 views Asked by At

I want to be able to download full histories of a few thousand articles from http://en.wikipedia.org/wiki/Special:Export and I am looking for a programmatic approach to automate it. I want to save result as XML.

Here is my Wikipedia query. I started the following in Python, but that doesn't get any useful result.

#!/usr/bin/python

import urllib
import codecs

f =  codecs.open('workfile.xml', 'w',"utf-8" )

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
urllib._urlopener = AppURLopener()

query = "http://en.wikipedia.org/w/index.php?title=Special:Export&action=submit"
data = { 'catname':'English-language_Indian_films','addcat':'', 'wpDownload':1 }
data = urllib.urlencode(data)
f = urllib.urlopen(query, data)
s = f.read()
print (s)
1

There are 1 answers

0
Snakes and Coffee On

I would honestly suggest using Mechanize to get the page, then using lxml or another xml parser to get the information you want. Usually I use the firefox user-agent as many program user-agents are blocked. Note that with Mechanize you can actually fill out the form and "click" enter, then "click" export.