Why is it ok to remove variables with low variance from a dataset

570 views Asked by At

It is a common practice in data analysis to remove features (independent variables) with low variance for dimensionality reduction, with the justification that a feature with low variance cannot explain much of the variance in the response variable (dependent variable).

However, I don't exactly understand this reasoning. Here is a counter example (in R syntax):

 > independent_variable <- c(100000, 100000.01, 100000.02, 100000.03, 100000.04, 100000.05 )
 > dependent_variable  <- c(1,2,3,4,5,6)
 > cor(independent_variable , dependent_variable)
 [1] 1          #pearsons correlation = 1
 > var(independent_variable )
 [1] 0.00035     
 > var(dependent_variable)
 [1] 3.5        # low variance of independent variable compared to dependent variable
 > var(independent_variable/mean(independent_variable))
 3.499998e-14   # very low variance
 > var(dependent_variable/mean(dependent_variable))
 [1] 0.2857143  # variance of scaled variables with mean=1
 

What I try to demonstrate in this example is a case where the dependent and independent variables have correlation=1 i.e. the independent variable explains 100% of the variance of the dependent variable, and yet, both in the original and in the mean=1 scaled variables, the variance of the independent variable is much lower than the variance of other variables (in this case, the dependent variable) and therefore it would have been removed according to this reasoning.

What do I miss here?

0

There are 0 answers