Can tfidf be weighed to improve classification of sparse data in a corpus?

129 views Asked by At

I am currently using tfidf prior to performing classification on a number of websites based on their content. Unfortunately, my training data is not uniform: about 70% of the pre-labeled websites are news sites, while the rest (tech, arts, entertainment, etc.) are each a vast minority.

My questions are the following:

  1. Is it possible to adjust tfidf so that it weighs different labels differently and make it behave as if the data were uniform? Should I perhaps be using a different approach in this case? I am currently using the Gaussian Naive Bayes classifier after the tfidf analysis, would something else be better suited in this specific case?

  2. Is it possible to have tfidf give me a list of possible labels when the probability that it is exactly a given label is below a certain threshold? For example, if the vector entries are close enough that it is only slightly (< 1-2%) more probable that it is one class rather than another, can it print both?

0

There are 0 answers