Multiple Linear Regression handle NA

1.2k views Asked by At

I am new to the world of Statistics, So some simple suggestions will be acknowledged ...

I have a data frame in R

Ganeeshan

  Year  General  OBC     SC    ST    VI   VacancySC VacancyGen VacancyOBC Banks Participated  VacancyST VacancyHI
1 2016    52.5  52.5  41.75  31.50  37.5      1338       4500       2319                 20       665       154
2 2015    76.0  76.0  50.00  47.75  36.0      1965       6146       3454                 23      1050       270
3 2014    82.0  80.0  70.00  56.00  38.0      2496       8212       4482                 23      1531       458
4 2013    61.0  60.0  50.00  26.00  27.0      3208      10846       5799                 21      1827       458
5 2012   135.0 135.0 127.00 106.00 127.0      3409      11058       6062                 21      1886       436

   VacancyOC VacancyVI
1       113       102
2       358       242
3       323       321
4       208       390
5       257       345

and want to built a linear Model taking dependent variable as "General", I used the following command

GaneeshanModel1 <- lm(General ~ ., data = Ganeeshan)

I get " NA " instead of values in summary of model

Call:

lm(formula = General ~ ., data = Ganeeshan)

Residuals: ALL 5 residuals are 0: no residual degrees of freedom!

Coefficients: (9 not defined because of singularities)

                      Estimate Std. Error t value Pr(>|t|)
(Intercept)          6566.6562         NA      NA       NA
Year                   -3.2497         NA      NA       NA
OBC                     0.5175         NA      NA       NA
SC                     -0.2167         NA      NA       NA
ST                      0.6078         NA      NA       NA
VI                          NA         NA      NA       NA
VacancySC                   NA         NA      NA       NA
VacancyGen                  NA         NA      NA       NA
VacancyOBC                  NA         NA      NA       NA
`Banks Participated`        NA         NA      NA       NA
VacancyST                   NA         NA      NA       NA
VacancyHI                   NA         NA      NA       NA
VacancyOC                   NA         NA      NA       NA
VacancyVI                   NA         NA      NA       NA

why I am not getting any data here

1

There are 1 answers

0
lucy On

This can happen if you don't do data preprocessing correctly first. It seems that your 'Bank' column is empty (NaN) and you should think about what to do with it (I am not sure if this is the whole file or there are other non-empty values inside your 'Bank' column). In general, before starting to use your data, you need to replace the NaN (empty) values in your columns with some numerical values (usually it is mean or median value of a column). In R, for your column 'Banks' (in case it has other non-empty values) for example you can do it like this:

dataset$Banks = ifelse(is.na(dataset$Banks),
                 ave(dataset$Banks, FUN = function(x) mean(x, na.rm = TRUE)),
                 dataset$Banks)

Otherwise, depending on your data set, if some of your values are represented by a period (or any other non number value) you can import your csv as

dataset = read.csv("data.csv", header = TRUE, c(" ", ".", "NA"))

to change 'period' and 'empty' values to NaN (NA) and after that use the line above to replace the NA (NaN) with mean/median/something else.