I have a conceptual question regarding text classification. I have a corpus of English language documents that I want to classify based on the content of the document. I am working on building a classifier - I'm not sure yet what method I will use: possibly SVMs, Bayes or NN. I will have a training set of documents, and of course a test set.
Here's my question: The corpus of documents will be added to over time, so it is possible that the classifier constructed now will, over time as the corpus changes, become less accurate. How do I keep the classifier current and accurate? Do I implement regular re-training? Is there a method of continuous training as the corpus changes? How is this circumstance handled?
You have two possible solutions:
(The easiest) if you cannot garantee a representative training dataset, you might consider redo the training step in regular periods (each time you have sufficent new examples).
you can consider active (or incremental) learning, however this method will require the final user interaction which is not always desired.