For some context, I have detected some cell populations and their associated counts in my cytometry data samples using FAUST.
I have some meta information that groups my samples into treatment groups (just Treatment "Yes" or "No").
Now I want to fit a mixed model with lme4::glmer
on my counts data, and perform contrasts for each cell population, to see if they are significantly over-represented in the "Yes" treatment group over the "No" treatment group (probably using emmeans
).
I'm not very familiar with linear models and model contrasts. I followed the last steps in this FAUST vignette (scroll all the way to the bottom).
Up to the lme4::glmer
step it all looks fine, but I'm not sure how to perform the contrasts here. multcomp::glht
as in the vignette does not give me the results that I want (ie: p-values and fold change for each cell population contrast across the treatment levels).
I know there are a lot of questions and references, but the more I read the more confused I get... Any help is appreciated, I feel like I'm almost there.
I have made a MWE starting from what my counts data looks like.
count_mat <- matrix(c(234,1287,234,1174,34,2285,7,7,
1157,779,45,572,293,3090,83,76,
124,1097,3,829,18,235,366,322,
341,1526,3,364,97,593,192,215,
378,330,18,960,20,293,204,197,
113,174,20,204,5,67,538,469,
14,552,2,464,3,25,911,978,
317,302,24,763,89,2791,71,55,
47,541,4,778,16,361,314,261,
322,192,14,1099,30,434,216,217,
120,231,18,829,24,812,97,83,
6,229,1,555,2,78,952,904,
88,628,0,1599,19,159,609,508,
1122,640,29,374,0,143,0,0,
1374,77,1884,688,0,289,5,2,
308,74,475,856,84,477,263,215,
59,0,2,508,1,0,0,1,
45,106,185,106,0,29,10,3,
7,142,30,61,0,12,54,28,
3,88,6,41,1,13,148,84,
4,51,2,44,0,3,115,80,
0,31,0,214,0,1,1281,1082,
0,24,0,225,0,1,1330,1057,
10,43,0,1412,1,19,806,624,
639,137,1108,1604,0,205,9,6,
96,767,8,587,1,59,28,11,
290,343,709,3666,10,668,511,335,
70,383,0,384,0,7,11,3,
160,149,297,3790,3,85,682,560,
49035,80096,23418,215461,3822,52535,62579,53099),
nrow=8, ncol=30)
colnames(count_mat) <- c(paste0('Pop_', 1:29), 'Pop_0')
rownames(count_mat) <- paste0('Samp_', 1:8)
cell_pops <- setdiff(colnames(count_mat), "Pop_0")
count_mat <- cbind(count_mat, parent_count=rowSums(count_mat))
count_df <- data.frame(ID=rownames(count_mat), as.data.frame(count_mat))
rownames(count_df) <- NULL
#
meta_info <- data.frame(ID=paste0('Samp_', 1:8), Treatment=c('Yes','Yes','Yes','No','No','Yes','No','No'), Patient=paste0('Pat_', 1:8))
count_df <- dplyr::inner_join(count_df, meta_info, by="ID")
This is what counts_df
looks like:
> count_df
ID Pop_1 Pop_2 Pop_3 Pop_4 Pop_5 Pop_6 Pop_7 Pop_8 Pop_9 Pop_10 Pop_11 Pop_12 Pop_13 Pop_14 Pop_15 Pop_16 Pop_17
1 Samp_1 234 1157 124 341 378 113 14 317 47 322 120 6 88 1122 1374 308 59
2 Samp_2 1287 779 1097 1526 330 174 552 302 541 192 231 229 628 640 77 74 0
3 Samp_3 234 45 3 3 18 20 2 24 4 14 18 1 0 29 1884 475 2
4 Samp_4 1174 572 829 364 960 204 464 763 778 1099 829 555 1599 374 688 856 508
5 Samp_5 34 293 18 97 20 5 3 89 16 30 24 2 19 0 0 84 1
6 Samp_6 2285 3090 235 593 293 67 25 2791 361 434 812 78 159 143 289 477 0
7 Samp_7 7 83 366 192 204 538 911 71 314 216 97 952 609 0 5 263 0
8 Samp_8 7 76 322 215 197 469 978 55 261 217 83 904 508 0 2 215 1
Pop_18 Pop_19 Pop_20 Pop_21 Pop_22 Pop_23 Pop_24 Pop_25 Pop_26 Pop_27 Pop_28 Pop_29 Pop_0 parent_count Treatment
1 45 7 3 4 0 0 10 639 96 290 70 160 49035 56483 Yes
2 106 142 88 51 31 24 43 137 767 343 383 149 80096 91019 Yes
3 185 30 6 2 0 0 0 1108 8 709 0 297 23418 28539 Yes
4 106 61 41 44 214 225 1412 1604 587 3666 384 3790 215461 240211 No
5 0 0 1 0 0 0 1 0 1 10 0 3 3822 4573 No
6 29 12 13 3 1 1 19 205 59 668 7 85 52535 65769 Yes
7 10 54 148 115 1281 1330 806 9 28 511 11 682 62579 72392 No
8 3 28 84 80 1082 1057 624 6 11 335 3 560 53099 61482 No
Patient
1 Pat_1
2 Pat_2
3 Pat_3
4 Pat_4
5 Pat_5
6 Pat_6
7 Pat_7
8 Pat_8
Then I turn the data into long format with only some selected cell populations:
sel_cell_pops <- cell_pops[c(2,3,4,5,6,7,8,9,10,11,12,13,16,17,19,20,21,22,23,24,26,27,28,29)]
glmer_df <- count_df[,c("ID", sel_cell_pops, "parent_count", "Treatment", "Patient")]
glmer_df <- tidyr::gather(glmer_df, key=pop_name, value=count, -c("ID", "parent_count", "Treatment", "Patient"))
glmer_df$pop_name <- as.factor(glmer_df$pop_name)
glmer_df$obs_factor <- as.factor(paste0("Obs_", seq(nrow(glmer_df))))
glmer_df$Patient <- factor(glmer_df$Patient, levels=gtools::mixedsort(unique(glmer_df$Patient)))
glmer_df$Treatment <- factor(glmer_df$Treatment, levels=gtools::mixedsort(unique(glmer_df$Treatment)))
#set reference level (not needed)
glmer_df$Treatment <- stats::relevel(glmer_df$Treatment, ref=levels(glmer_df$Treatment)[1])
#my reference level here for Treatment is "No"
#comparisons would be Treatment Yes vs No for each population
#
#do these on the fly calculation here or emmeans complains
glmer_df$parent_count2 <- glmer_df$parent_count - glmer_df$count
This is what glmer_df
(the data that goes into my linear model) looks like
> head(glmer_df)
ID parent_count Treatment Patient pop_name count obs_factor parent_count2
1 Samp_1 56483 Yes Pat_1 Pop_2 1157 Obs_1 55326
2 Samp_2 91019 Yes Pat_2 Pop_2 779 Obs_2 90240
3 Samp_3 28539 Yes Pat_3 Pop_2 45 Obs_3 28494
4 Samp_4 240211 No Pat_4 Pop_2 572 Obs_4 239639
5 Samp_5 4573 No Pat_5 Pop_2 293 Obs_5 4280
6 Samp_6 65769 Yes Pat_6 Pop_2 3090 Obs_6 62679
Now I fit the model. Here I have my first question: I specify (1|Patient)
as random effect in my model, but in the FAUST vignette I follow, they do (1|obs_factor)
; what would be more appropriate with my data?
#fit the model
glmer_fit <- lme4::glmer(
formula=cbind(count, parent_count2) ~ pop_name*Treatment + (1|Patient), #should this be (1|obs_factor)?
data=glmer_df,
family="binomial",
nAGQ=0,
control=lme4::glmerControl(
optimizer="bobyqa",
optCtrl = list(maxfun = 1e9),
boundary.tol = 1e-9,
tolPwrss=1e-9,
check.conv.singular = lme4::.makeCC(action = "warning", tol = 1e-12)
)
)
This is where I have trouble, I use emmeans
as I saw in other questions, but I don't make sense of the result
em_result <- emmeans::emmeans(glmer_fit, ~ pop_name*Treatment)
contrast_result <- emmeans::contrast(em_result, interaction = "pairwise")
This is what I get:
> head(contrast_result)
pop_name_pairwise Treatment_pairwise estimate SE df z.ratio p.value
Pop_10 - Pop_11 No - Yes 0.621 0.0592 Inf 10.484 <.0001
Pop_10 - Pop_12 No - Yes -1.560 0.0728 Inf -21.440 <.0001
Pop_10 - Pop_13 No - Yes -0.659 0.0566 Inf -11.638 <.0001
Pop_10 - Pop_16 No - Yes 0.426 0.0561 Inf 7.587 <.0001
Pop_10 - Pop_17 No - Yes -1.640 0.1416 Inf -11.583 <.0001
Pop_10 - Pop_19 No - Yes 0.775 0.1180 Inf 6.564 <.0001
Results are given on the log odds ratio (not the response) scale.
But since what I want is to test each population individually to see if it's greater
in Treatment "Yes" vs "No", I would expect something more like:
pop_name Treatment_pairwise estimate SE df z.ratio p.value
Pop_1 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pop_2 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pop_3 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pop_4 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pop_5 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pop_6 Yes - No xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
How should I accomplish this? A line per cell population with it's contrast for Treatment "Yes" (greater
) vs "No", with p-values and fold change if possible.
Thanks!
UPDATE:
I think I might have figured the right contrast
commands, as the outputs seem closer to what I want... are these correct for my purposes?
#1
em_result1 <- emmeans::emmeans(glmer_fit, ~ Treatment | pop_name)
contrast_result1 <- emmeans::contrast(em_result1, interaction = "pairwise", adjust = "bh")
#2
em_result2 <- emmeans::emmeans(glmer_fit, ~ pop_name*Treatment)
contrast_result2 <- emmeans::contrast(em_result2, "consec", simple = "Treatment", combine = TRUE, adjust = "bh")
The results seem identical, just with Treatment levels "flipped" (why is that?). I'm leaning more towards #2
as "Yes vs No" seems "more correct" given that "No" is my reference level...
> head(contrast_result1)
Treatment_pairwise pop_name estimate SE df z.ratio p.value
No - Yes Pop_10 0.357 0.221 Inf 1.616 0.1592
No - Yes Pop_11 -0.264 0.221 Inf -1.195 0.2785
No - Yes Pop_12 1.917 0.225 Inf 8.513 <.0001
No - Yes Pop_13 1.015 0.220 Inf 4.606 <.0001
No - Yes Pop_16 -0.069 0.220 Inf -0.313 0.7543
No - Yes Pop_17 1.997 0.256 Inf 7.806 <.0001
Results are given on the log odds ratio (not the response) scale.
P value adjustment: BH method for 6 tests
> head(contrast_result2)
contrast pop_name estimate SE df z.ratio p.value
Yes - No Pop_10 -0.357 0.221 Inf -1.616 0.1592
Yes - No Pop_11 0.264 0.221 Inf 1.195 0.2785
Yes - No Pop_12 -1.917 0.225 Inf -8.513 <.0001
Yes - No Pop_13 -1.015 0.220 Inf -4.606 <.0001
Yes - No Pop_16 0.069 0.220 Inf 0.313 0.7543
Yes - No Pop_17 -1.997 0.256 Inf -7.806 <.0001
Results are given on the log odds ratio (not the response) scale.
P value adjustment: BH method for 6 tests
So would this be the right way to do it for my purposes?
NOTE:
I read somewhere that, because this is a mixed model, I should run confint
on the emmeans
result... how should I integrate this?
Thanks!