calculate Mahalanabois distances when have missing values

508 views Asked by At

In R, I am trying to calculate Mahalanobis distances to check if there are outliers in my data set, to test one of the assumptions for a MANOVA. I have missing values in my data set. I originally had tried the mahalanabois function, but that didn't seem to work with missing values, so I tried the MDmiss function in the modi package. This worked for the cases where I had missing values in two of my variables both (DO, and chla). However, if I was only missing data in chla or DO, the distances were not calculated. Neither the MDmiss nor the mahalanobis function returned distances when I lacked missing values.

I had also tried using the is.na and na.omit arguments in the original Mahalanobis distances function, but that didn't work either. I have included a sample data set. Appreciate the help. Thanks.

envdata <- data.frame(WaterTemp = c(56.7, 56.4, 60.8,60.6, 59.3, 57.5, 57.9, 65.8,59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, NA, NA, NA, 6.306, 26.84, NA, NA))


#Check for outliers using the Mahalanobis distance
#https://www.statology.org/mahalanobis-distance-r/

#Mahalanobis only works on numeric data. Make new data frame with only numeric variables 
#Convert integers to numeric
envdata <- envdata %>% mutate(SPC = as.numeric(envdata$SPC), DO = as.numeric(envdata$DO))
envdata_numeric <- envdata %>% dplyr::select(WaterTemp, SPC, Salinity, DO, Chla)

#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- mahalanobis(envdata_numeric, colMeans(envdata_numeric, na.rm = TRUE), cov(envdata_numeric))

#create new column in data frame to hold p-value for each Mahalanobis distance
envdata_numeric$p <- pchisq(envdata_numeric$mahal, df = 4, lower.tail = FALSE)
#Df = (c-1)
#DF = 5-1

envdata_numeric

#***#error with calculating distances. Possibly because of NA values. Try this other package. https://search.r-project.org/CRAN/refmans/modi/html/MDmiss.html
devtools::install_github("martinSter/modi")
library(modi)

#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- MDmiss(envdata_numeric, colMeans(envdata_numeric), cov(envdata_numeric))
2

There are 2 answers

2
Alvaro Salas On

I use a new method for that, I Evaluate the dterminant of the matrix S. That dterminant dpends on the unknown values. I then minimize that positive determinant (quadraticd form) and then I got the predicted values

0
Artem On

There is a problem with the data you shown, columns DO and Chal are collinear. Namely you have only two complete observation (see Row 3 and 8 of envdata_numeric below):

envdata_numeric <- structure(list(WaterTemp = c(56.7, 56.4, 60.8, 60.6, 59.3, 57.5, 
57.9, 65.8, 59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 
47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 
31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, 
NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, 
NA, NA, NA, 6.306, 26.84, NA, NA)), class = "data.frame", row.names = c(NA, 
-10L))

# WaterTemp   SPC Salinity  DO   Chla
# 1       56.7 46600    30.28  NA  7.045
# 2       56.4 47520    30.92  NA     NA
# 3       60.8 47821    31.54  96  8.358
# 4       60.6 47801    31.34  NA     NA
# 5       59.3 47999    31.24  NA     NA
# 6       57.5 47418    30.87  NA     NA
# 7       57.9 47646    31.03  NA  6.306
# 8       65.8 49156    32.17 101 26.840
# 9       59.2 46350    30.12  99     NA
# 10      59.0 46260    30.05 103     NA

Roughly speaking you are trying to find outliers or calculate distances however you do not have enough information to "draw the elipsoid" around the cloud of your points. This is what geometrically mahalanobis is doing. I sketched the situation below: by white circles are columns without NA, big red are indicate two points which are located in higher dimensions (Row 3 and 8). There are infinitely many elipsoids that can be drawn through 2 points and the center (I drew 2).

enter image description here

Anyway if I add some data point into DO column e.g. to Row 1 100 then proceed with imputation (I used mice package) I can formally calculate distances. As you will see p-values will be > 0.1. The meaning that however the algorithm works, it is not enough to judge about outliers even on 3 observations. Too much NAs.

library(mice)
envdata_numeric[1, "DO"] <- 100
envdata_numeric_imp <- complete(mice(envdata_numeric))
envdata_numeric_imp$maha <- mahalanobis(envdata_numeric_imp, 
                                        colMeans(envdata_numeric_imp), 
                                        cov(envdata_numeric_imp))

envdata_numeric_imp$p = pchisq(envdata_numeric_imp$maha, df = 4, 
                               lower.tail = FALSE)


envdata_numeric_imp

Output:

Water   Temp   SPC Salinity  DO   Chla     maha         p
1       56.7 46600    30.28 100  7.045 1.274517 0.8656841
2       56.4 47520    30.92 103  7.045 3.554027 0.4697112
3       60.8 47821    31.54  96  8.358 7.201919 0.1255948
4       60.6 47801    31.34 103  6.306 3.968202 0.4103263
5       59.3 47999    31.24  96  6.306 5.790871 0.2153200
6       57.5 47418    30.87 101 26.840 6.985705 0.1366456
7       57.9 47646    31.03 101  6.306 1.523915 0.8223970
8       65.8 49156    32.17 101 26.840 7.254101 0.1230542
9       59.2 46350    30.12  99  6.306 3.556350 0.4693622
10      59.0 46260    30.05 103  7.045 3.890395 0.4210425