How to get doc2vec to reliably work with UMAP?

106 views Asked by At

I have the following pyspark code where I build a Doc2vev model and run UMAP on it. Only sometimes the last UMAP line will throw the error "Cannot assign slice from input of different size".

I can try to specify a random starting seed to find one that always converges for this specific document input, but I really want to improve the model code so it can take any similar document with different data and always converge without me having to manually find a starting seed that works.

What is it about the doc2vec model that makes it sometimes not work with the UMAP function that I can improve?

train_corpus = [gensim.models.doc2vec.TaggedDocument([word for word in agg_corpus_dict[i]['doc'] if word is not None], [str(agg_corpus_dict[i]['id'])]) for i in range(len(agg_corpus_dict))]

model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=20, epochs=20, workers=4)

progress_per_value = 1000  
model.build_vocab(train_corpus, progress_per=progress_per_value)

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

model.make_cum_table()
model.save('<model_dir>')
# Load fitted doc2vec model
doc2vec_model = gensim.models.doc2vec.Doc2Vec.load('<model_dir>')
reducer = umap.UMAP().fit(doc2vec_model.dv.vectors)

I've tried random starting seeds.

1

There are 1 answers

0
gojomo On

Editing your question to show the full error you receive, including all the lines of traceback showing involved lines-of-code/files, will help answerers determine what's going on.

If the exact same code, using the exact same Doc2Vec model, sometimes succeeds & sometimes fails, that implies some instability in the UMAP code. (Still, seeing the whole erro/traceback might offer clues.)

If, on the other hand, it fails reliably on some frozen Doc2Vec models, but not others, your should add extra output to determine what's different about the cases that it succeeds & fails. For example, print(d2v_model.dv.vectors.shape) before the line that sometimes fails, & examine (or share in your question) the outputs from both successful & failing runs.

If that shows no obvious difference/coding-error between working and non-working cases, I suppose there's a chance the UMAP code is sensitive in some way to the exact values inside the Doc2Vec vectors. I wouldn't normally expect that – in the usual case, all vectors have nonzero dimensions, and I'd expect an algorithm that works on one set of such dimensions to work on others.

But I suppose it might be possible, especially if you're running on a small or quirky amount of data, that some runs are leaving some vector dimensions in weird states – like lots of 0.0 values – and that's perhaps creating problems for the UMAP step, if it assumes otherwise. So if nothing else improves things, that'd be something else to check.