I have a dataset from amazon reviews dataset: meta_Electronics.json.gz
The below code is given by instructor:
def read_product_description(fname):
'''
Load all product descriptions
Args:
fname: dataset file path
Returns:
dict: key is asin, value is description content
'''
result = {}
for i in parse(fname):
try:
if "Camera & Photo" in i["categories"][0]:
result[i["asin"]]=i["description"]
except:
continue
return result
I think the above code filters reviews in camera& photo category.
class TaggedDescriptionDocument(object):
'''
This class could save all products and review information in its dictionary and generate iter for TaggedDocument
which could used for Doc2Vec model
'''
def __init__(self, descriptondict):
self.descriptondict = descriptondict
def __iter__(self):
for asin in self.descriptondict:
for content in self.descriptondict[asin]:
yield TaggedDocument(clean_line(content), [asin])
Note: clean_line just cleans every single line in the content,remove punctuation,etc.
description_dict = read_product_description("meta_Electronics.json.gz")
des_documents = TaggedDescriptionDocument(description_dict)
After the above two functions,I think it creates a taggeddocument used for doc2vec model. However,when I tried to train a doc2vec model,it shows:
model_d = Doc2Vec(des_documents, vector_size=100, window=15, min_count=0, max_vocab_size=1000)
RuntimeError: you must first build vocabulary before training the model
The min_count is already 0. Is there anything wrong with the code? Any help will be appreciated!
The
you must first build vocabularyerror suggests something, such as a buggy corpus, prevented any vocabulary from being discovered.Are you sure
des_documentscontains what you intended it to?For example:
sum(1 for _ in des_documents)repeatedly, does it report the same count of documents you expect?next(iter(des_documents)– show a validTaggedDocumentobject with sensiblewordsandtags?You should also try enabling logging at the INFO level, and try all steps again, watching the logged output carefully for any hints something is going wrong. (Do steps take a reasonable mount of time, & report counts of discovered/surviving words that make sense?)
max_vocab_size=1000is almost certainly an unhelpful setting. It doesn't cap the final surviving vocabulary - it causes the initial vocabulary-scan to never remember more than 1000 words. And further, to ruthlessly enforce that cap in a crude but low-overhead way, every time it hits the cap, it discards all words with fewer occurrences than an ever-escalating floor.This setting was only intended as a crude way to prevent vocabulary discovery from exhausting all RAM, and if used at all, should be set to some value far, far larger than whatever vocabulary size you desire or expect. So: your atypically-tiny value of
1000, together with any amount of data sufficient for an algorithm likeDoc2Vec(lots and lots of varied words) could be contributing to your problem.With any dataset you've already got loaded in memory, it's unlikely a needed setting at all.
Separately,
min_count=0is almost always a bad setting for these algorithms, which only effectively model words with many contrasting usage examples. Throwing out words that only appear a few times usually improves the overall quality of the surviving learned vectors – hence the defaultmin_count=5.