In R, I am trying to calculate Mahalanobis distances to check if there are outliers in my data set, to test one of the assumptions for a MANOVA. I have missing values in my data set. I originally had tried the mahalanabois function, but that didn't seem to work with missing values, so I tried the MDmiss function in the modi package. This worked for the cases where I had missing values in two of my variables both (DO, and chla). However, if I was only missing data in chla or DO, the distances were not calculated. Neither the MDmiss nor the mahalanobis function returned distances when I lacked missing values.
I had also tried using the is.na and na.omit arguments in the original Mahalanobis distances function, but that didn't work either. I have included a sample data set. Appreciate the help. Thanks.
envdata <- data.frame(WaterTemp = c(56.7, 56.4, 60.8,60.6, 59.3, 57.5, 57.9, 65.8,59.2, 59), SPC = c(46600, 47520, 47821, 47801, 47999, 47418, 47646, 49156, 46350, 46260), Salinity = c(30.28, 30.92, 31.54, 31.34, 31.24, 30.87, 31.03, 32.17, 30.12, 30.05), DO = c(NA, NA, 96, NA, NA, NA, NA, 101, 99, 103), Chla = c(7.045, NA, 8.358, NA, NA, NA, 6.306, 26.84, NA, NA))
#Check for outliers using the Mahalanobis distance
#https://www.statology.org/mahalanobis-distance-r/
#Mahalanobis only works on numeric data. Make new data frame with only numeric variables
#Convert integers to numeric
envdata <- envdata %>% mutate(SPC = as.numeric(envdata$SPC), DO = as.numeric(envdata$DO))
envdata_numeric <- envdata %>% dplyr::select(WaterTemp, SPC, Salinity, DO, Chla)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- mahalanobis(envdata_numeric, colMeans(envdata_numeric, na.rm = TRUE), cov(envdata_numeric))
#create new column in data frame to hold p-value for each Mahalanobis distance
envdata_numeric$p <- pchisq(envdata_numeric$mahal, df = 4, lower.tail = FALSE)
#Df = (c-1)
#DF = 5-1
envdata_numeric
#***#error with calculating distances. Possibly because of NA values. Try this other package. https://search.r-project.org/CRAN/refmans/modi/html/MDmiss.html
devtools::install_github("martinSter/modi")
library(modi)
#create new column in data frame to hold Mahalanobis distances
envdata_numeric$mahal <- MDmiss(envdata_numeric, colMeans(envdata_numeric), cov(envdata_numeric))
I use a new method for that, I Evaluate the dterminant of the matrix S. That dterminant dpends on the unknown values. I then minimize that positive determinant (quadraticd form) and then I got the predicted values