I have trained a doc2vec model on the Wikipedia corpus using gensim and I wish to retrieve vectors from different documents.

I was wondering what text processing the WikiCorpus function did when I used it to train my model e.g. removed punctuation, made all the text lower case, removed stop words etc.

This is important as I wish to perform the same text processing on the documents I am inferring vectors from for greater consistency/accuracy with my model.

1 Answers

gojomo On Best Solutions

To know precisely what's done, your best reference is the source code for WikiCorpus itself, which you can view in your local installation, or online at:


Key functions in that file for dealing with the raw Wikipedia dump data include process_article(), filter_wiki() and remove_markup() – which ultimately also uses a local tokenize() function, that then relies on another tokenize() from the gensim.utils module.

And, WikiCorpus does in fact call that utils.tokenize() with a lower=True parameter to force lowercasing.

Further, that utils.tokenize() uses a simple_tokenize() function that, while it doesn't have a step that explicitly removes punctuation, looks for tokens via a PAT_ALPHABETIC regex which selects tokens made of word-characters (\w) that don't start with digits (\d).