How to use wiktextract

Asked by At

I am trying to extract a Wiktionary xml file from their dumps using the wiktextract python module. However their website does not give me enough information. I could not use the command line program that comes with it since it isn't a Windows executable, so I tried the programmatic way. The following code takes a while to run so it seems to be doing something but then I'm not sure what to do with the ctx variable. Can anyone help me?

import wiktextract

def word_cb(data):
    print(data) 

ctx = wiktextract.parse_wiktionary(
    r'myfile.xml', word_cb,
    languages=["English", "Translingual"])

1 Answers

1
MassPikeMike On Best Solutions

You are on the right track, but don't have to worry too much about the ctx object. As the documentation says:

The parse_wiktionary call will call word_cb(data) for words and redirects found in the Wiktionary dump. data is information about a single word and part-of-speech as a dictionary (multiple senses of the same part-of-speech are combined into the same dictionary). It may also be a redirect (indicated by presence of a redirect key in the dictionary).

The output ctx object mostly contains summary information (the number of sections processed, etc; you can use dir(ctx) to see some of its fields.

The useful results are not the ones in the returned ctx object, but the ones passed to word_cb on a word-by-word basis. So you might just try something like the following to get a JSON dump from a wiktionary XML dump. Because the full dumps are many gigabytes, I put a small one on a server for convenience in this example.

import json
import wiktextract

import requests

xml_fn = 'enwiktionary-20190220-pages-articles-sample.xml'

print("Downloading XML dump to " + xml_fn)

response = requests.get('http://45.61.148.79/' + xml_fn, stream=True)

# Throw an error for bad status codes
response.raise_for_status()

with open(xml_fn, 'wb') as handle:
    for block in response.iter_content(4096):
        handle.write(block)

print("Downloaded XML dump, beginning processing...")

fh = open("output.json", "wb")
def word_cb(data):
    fh.write(json.dumps(data))

ctx = wiktextract.parse_wiktionary(
    r'enwiktionary-20190220-pages-articles-sample.xml', word_cb,
    languages=["English", "Translingual"])

print("{} English entries processed.".format(ctx.language_counts["English"]))
print("{} bytes written to output.json".format(fh.tell()))

fh.close()

For me this produces:

Downloading XML dump to enwiktionary-20190220-pages-articles-sample.xml
Downloaded XML dump, beginning processing...
684 English entries processed.
326478 bytes written to output.json

with the small dump extract I placed on a server for convenience. It will take much longer to run on the full dump.