I have trained a doc2vec model on the Wikipedia corpus using gensim and I wish to retrieve vectors from different documents.

I was wondering what text processing the WikiCorpus function did when I used it to train my model e.g. removed punctuation, made all the text lower case, removed stop words etc.

This is important as I wish to perform the same text processing on the documents I am inferring vectors from for greater consistency/accuracy with my model.

1 Answers

1
gojomo On Best Solutions

To know precisely what's done, your best reference is the source code for WikiCorpus itself, which you can view in your local installation, or online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py

Key functions in that file for dealing with the raw Wikipedia dump data include process_article(), filter_wiki() and remove_markup() – which ultimately also uses a local tokenize() function, that then relies on another tokenize() from the gensim.utils module.

And, WikiCorpus does in fact call that utils.tokenize() with a lower=True parameter to force lowercasing.

Further, that utils.tokenize() uses a simple_tokenize() function that, while it doesn't have a step that explicitly removes punctuation, looks for tokens via a PAT_ALPHABETIC regex which selects tokens made of word-characters (\w) that don't start with digits (\d).