In natural language processing (NLP), how do you make an efficient dimension reduction?

1.5k views Asked by At

In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).

I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:

1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.

2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?

3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?

Many Thanks!

1

There are 1 answers

0
Andreas Mueller On

1) You don't need dimensionality reduction. If you really want, you can use an L1 penalized linear classifier to reduce to the most helpful features.

2) Cosine similarity is often used, or cosine similarity of the TFIDF rescaled vectors.

3) Linear SVMs work best with so many features.

There is a good tutorial on how to do classification like this in python here: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html