Why is it ok to remove variables with low variance from a dataset

631 views Asked by Amnon At 03 February 2021 at 18:36

It is a common practice in data analysis to remove features (independent variables) with low variance for dimensionality reduction, with the justification that a feature with low variance cannot explain much of the variance in the response variable (dependent variable).

However, I don't exactly understand this reasoning. Here is a counter example (in R syntax):

 > independent_variable <- c(100000, 100000.01, 100000.02, 100000.03, 100000.04, 100000.05 )
 > dependent_variable  <- c(1,2,3,4,5,6)
 > cor(independent_variable , dependent_variable)
 [1] 1          #pearsons correlation = 1
 > var(independent_variable )
 [1] 0.00035     
 > var(dependent_variable)
 [1] 3.5        # low variance of independent variable compared to dependent variable
 > var(independent_variable/mean(independent_variable))
 3.499998e-14   # very low variance
 > var(dependent_variable/mean(dependent_variable))
 [1] 0.2857143  # variance of scaled variables with mean=1

What I try to demonstrate in this example is a case where the dependent and independent variables have correlation=1 i.e. the independent variable explains 100% of the variance of the dependent variable, and yet, both in the original and in the mean=1 scaled variables, the variance of the independent variable is much lower than the variance of other variables (in this case, the dependent variable) and therefore it would have been removed according to this reasoning.

What do I miss here?

Original Q&A

TechQA.

Why is it ok to remove variables with low variance from a dataset

There are 0 answers

Related Questions in STATISTICS

Related Questions in DATA-ANALYSIS

Related Questions in FEATURE-SELECTION

Related Questions in DIMENSION-REDUCTION

Popular Questions

Trending Questions