How to compute TukeyHSD letters with multcompLetters4 and associate them to the originl dataset by group

52 views Asked by At

I’m starting with R, and I’m facing an issue but I ignore its origin, so I come here to ask for your help and I hope you will be able to help me I have a dataset containing the following columns: Y = response trt : 1st explanatory variable lot : 2nd explanatory variable Location : is the column to group my data, so “aov” runs by Location. My objective is to extract Tukey-test goup letters output + Studentized residuals to flag outliers. For that , I’m using the following code :

library(multcomp)
library(multcompView)
library(dplyr)
library(tidyr)
library(na.tools)
library(tibble)

#My datasets and my variables
output <- as.data.frame(dataset)
output$response <- as.numeric(output$response)
output$trt <- as.factor(output$trt)
output$lot <- as.factor(output$lot)
output <- output %>% 
     group_by(location) %>% 
     mutate(row=row_number(),unique_trt = n_distinct(trt), unique_lot = n_distinct(lot), var = var(response, na.rm = TRUE))


# create a subset dataset where I use filter to exclude cases where one of the factors has only 1 level and where variance in data is null 
lsplit <-  output %>%
         filter( var != 0 & unique_trt != 1 & unique_lot != 1)
lsplit <- lsplit %>% mutate(response3 = replace_na(response, mean(response, na.rm = TRUE)))

# split my sub-dataset by "location"
lsplit <- split(lsplit, ~ location)
# Create a function of lm and TukeyHSD to apply by "Location" and combine the output to the original dataset "output"
all_results <- lapply(lsplit,function(x){
     model <- aov(reponse3 ~ trt + lot, data = x, na.rm=TRUE)
     tukey <- TukeyHSD(model)
     cld <- multcompLetters4(model, tukey)
     TK <- group_by(x,trt) %>% summarise (mean = mean(response3, na.rm=TRUE)) %>% arrange(desc(mean))
     cld <- as.data.frame.list(cld$trt)
     TK$cld <- cld$Letters
     x$stud_resid <- studres(model)
     x <- left_join(x, TK, by = "trt")
      x
  })
output <- left_join(output,bind_rows(all_results) %>% select(location, mean, stud_resid, row, cld) , by=c("location", "row"))

This code works perfectly for a sample of locations that I’m using for testing, but once I apply it to the whole dataset (I’m using it on a dataset of around 20 M rows in Power query), I get an error message and I don’t know from which location this comes, that’s why I can’t provide an example to reproduce the error. Here is the error message I get :

“ Error in vec2mat(x) : 36 NAs not allowed, found in x” Any help os most welcome Thank you in advance for your help !

0

There are 0 answers