I’m starting with R, and I’m facing an issue but I ignore its origin, so I come here to ask for your help and I hope you will be able to help me I have a dataset containing the following columns: Y = response trt : 1st explanatory variable lot : 2nd explanatory variable Location : is the column to group my data, so “aov” runs by Location. My objective is to extract Tukey-test goup letters output + Studentized residuals to flag outliers. For that , I’m using the following code :
library(multcomp)
library(multcompView)
library(dplyr)
library(tidyr)
library(na.tools)
library(tibble)
#My datasets and my variables
output <- as.data.frame(dataset)
output$response <- as.numeric(output$response)
output$trt <- as.factor(output$trt)
output$lot <- as.factor(output$lot)
output <- output %>%
group_by(location) %>%
mutate(row=row_number(),unique_trt = n_distinct(trt), unique_lot = n_distinct(lot), var = var(response, na.rm = TRUE))
# create a subset dataset where I use filter to exclude cases where one of the factors has only 1 level and where variance in data is null
lsplit <- output %>%
filter( var != 0 & unique_trt != 1 & unique_lot != 1)
lsplit <- lsplit %>% mutate(response3 = replace_na(response, mean(response, na.rm = TRUE)))
# split my sub-dataset by "location"
lsplit <- split(lsplit, ~ location)
# Create a function of lm and TukeyHSD to apply by "Location" and combine the output to the original dataset "output"
all_results <- lapply(lsplit,function(x){
model <- aov(reponse3 ~ trt + lot, data = x, na.rm=TRUE)
tukey <- TukeyHSD(model)
cld <- multcompLetters4(model, tukey)
TK <- group_by(x,trt) %>% summarise (mean = mean(response3, na.rm=TRUE)) %>% arrange(desc(mean))
cld <- as.data.frame.list(cld$trt)
TK$cld <- cld$Letters
x$stud_resid <- studres(model)
x <- left_join(x, TK, by = "trt")
x
})
output <- left_join(output,bind_rows(all_results) %>% select(location, mean, stud_resid, row, cld) , by=c("location", "row"))
This code works perfectly for a sample of locations that I’m using for testing, but once I apply it to the whole dataset (I’m using it on a dataset of around 20 M rows in Power query), I get an error message and I don’t know from which location this comes, that’s why I can’t provide an example to reproduce the error. Here is the error message I get :
“ Error in vec2mat(x) : 36 NAs not allowed, found in x” Any help os most welcome Thank you in advance for your help !