I'm working on text classification and I have a set of 200.000 tweets.

The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.

What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.

2 Answers

azeldes On

This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?

If it's the first scenario, I think the steps in order of importance would be:

  1. Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
  2. Stratify by features you plan to use
  3. Stratify by tweet length/vocabulary etc.

If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:

  1. Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
  2. Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
  3. Use something like LDA to find n topics, then sample stratified by topic

Hope this helps!

David Dale On

It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.

After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.

Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.