I am trying to solve a multi-class single-label document classification problem assigning a single class to a document. Documents are domain-specific technical documents, with technical terms:
- Train: I have 19 classes with a single document in each class.
- Target: I have 77 documents without labels I want to classify to the 19 known classes.
- Documents have between 60-3000 tokens after pre-processing.
- My entire corpus (19+77 documents) have 65k terms (uni/bi/tri-grams) with 4.5k terms in common (between train and target)
Currently, I am vectorizing documents using a tf-idf vectorizer and reducing dimensions to common terms. Then doing a cosine similarity between train and target.
I am wondering if there is a better way? I cannot use sklearn classifiers due to a single document in each class in train. Any ideas on a possible improvement/direction? Especially:
- Does it make sense to use word-embeddings/doc2vec given the small corpus?
- Does it make sense to generate synthetic train data from the terms in the training set?
- Any other ideas?
Thanks in advance!
Good to see that you've considered the usual strategies - generating synthetic data, pretrained word embeddings - for a semisupervised text classification scenario. Unfortunately, since you only have one training example per class, no matter how good your feature extraction or how effective your data generation, the classifier you train will almost certainly not generalize. You need more (real) labelled data.