Generalized linear mixed effects model with one-inflated proportion data

144 views Asked by At

I’ll preface this with I am NOT savvy with statistics, so I appreciate any help, especially in simple terms. I hope to understand and not just get the “correct” answer. I also admit I am most comfortable at looking at p-values to assess model output, which I know is controversial.

I have ecological data: a categorical predictor (treatment2), a categorical random effect (replicate), and a response variable (mean.occ) that is a proportion between 0 and 1 with some 1s, i.e. one-inflated. For reference, this response variable is the proportion of occupancy counts for a species in a patch over time, where each data point is either 1 (occupied) or 0 (absent). So mean.occ is the proportion of occupancies (1s) over the entire time series. There are no 0s, as this species shows up at least once in all replicates over the course of the time series, but some replicates have 1s, meaning this species showed up in every sample for the entire data collection period.

The data are linked here. I am providing the dataset because I truly think the way the data is skewed with the 1s (and how many 1s etc.) is contributing to the problem.

Below is the histogram of the response variable, mean.occ, showing the skew of the data:

histogram

Here is a simple box plot showing the relationship between treatment2 and mean.occ: enter image description here

I have tried many approaches, none of which give me output that makes sense with my data (i.e. the p-values between certain treatments are non-significant, when they are clearly significant, e.g. between hetero and homo.high).

My original model is a glmm with a binomial family and logit link, as follows:

main.model.1 <- glmmTMB(mean.occ ~ 1 + treatment2 + (1|replicate), 
                      family = binomial(link="logit"), data = regB)

Using summary(main.model.1) and pairs(emmeans(main.model.1, ~treatment2)) for post-hoc analysis, the output is very off when comparing to the boxplot (p-values are clearly wrong), see snippet here:

contrast                    estimate     SE df t.ratio p.value
 hetero - homo.high             1.801  1.136 59   1.585  0.7569
 hetero - homo.low             -5.439 10.578 59  -0.514  0.9995
 hetero - homo.med             -1.206  1.622 59  -0.743  0.9952
 hetero - single.hetero         0.998  1.052 59   0.949  0.9796
 hetero - single.high           2.385  1.188 59   2.008  0.4855
 hetero - single.low           -3.642  4.403 59  -0.827  0.9908
 hetero - single.med            0.615  1.155 59   0.532  0.9994

Diagnosing with DHARMa, diagnose(main.model.1) gives Unusually large coefficients Large negative coefficients in zi (log-odds of zero-inflation), dispersion, or random effects (log-standard deviations) suggest unnecessary components (converging to zero on the constrained scale); large negative and/or positive components in binomial or Poisson conditional parameters suggest (quasi-)complete separation. Large values of nbinom2 dispersion suggest that you should use a Poisson model instead. (Note: I tried with poisson family even though the data do not have a Poisson distribution, and it did not give accurate results either.)

The first thing to try that got me closer was to transform the data to essentially switch the 1s to zeros and do a zero-inflated model with a Beta family:

regB$mean.occ.zi <- 1 - regB$mean.occ 
m.ziformula.beta <- glmmTMB(mean.occ.zi ~ 1 + treatment2 + (1|replicate), 
                             family = beta_family,
                             data = regB,
                             ziformula = ~1)

This got me a little closer… but still some funky estimates and p-values that don't make sense, see a snippet of the post-hoc output here:

contrast                    estimate    SE df t.ratio p.value
 hetero - homo.high            -0.905 0.332 57  -2.724  0.1365
 hetero - homo.low              2.776 0.983 57   2.824  0.1092
 hetero - homo.med              1.087 0.366 57   2.966  0.0784
 hetero - single.hetero        -0.588 0.266 57  -2.206  0.3637
 hetero - single.high          -1.649 0.311 57  -5.305  0.0001
 hetero - single.low            1.596 0.676 57   2.361  0.2803
 hetero - single.med           -0.531 0.283 57  -1.874  0.5739

Diagnosing with DHARMa, diagnose(m.ziformula.beta), now gives Unusually large Z-statistics suggest a *possible* failure of the Wald approximation - often also associated with parameters that are at or near the edge of their range (e.g. random-effects standard deviations approaching 0). (Alternately, they may simply represent very well-estimated parameters; intercepts of non-centered models may fall in this category.) however the residuals etc. don't look bad:

enter image description here

The next thing to try that got me closer was to transform the data so I do not have any 1s and use a Beta family:

regB$mean.occ.trans <- (regB$mean.occ*(length(regB$mean.occ)-1)+.5)/length(regB$mean.occ)
m.beta.test.trans <- glmmTMB(mean.occ.trans ~ 1 + treatment2 + (1|replicate), 
                       family = beta_family, data = regB)

The summary and post-hoc output is much closer to what the data look like… However some comparisons are still off, see a snippet here:

contrast                    estimate    SE df t.ratio p.value
 hetero - homo.high            0.8854 0.305 58   2.907  0.0898
 hetero - homo.low            -2.4312 0.398 58  -6.107  <.0001
 hetero - homo.med            -1.2352 0.321 58  -3.844  0.0069
 hetero - single.hetero        0.5674 0.243 58   2.335  0.2930
 hetero - single.high          1.6584 0.285 58   5.825  <.0001
 hetero - single.low          -2.3493 0.396 58  -5.928  <.0001
 hetero - single.med           0.5230 0.257 58   2.031  0.4706

Diagnosing with DHARMa, diagnose(m.beta.test.trans), again gives the same output Unusually large Z-statistics suggest a *possible* failure of the Wald approximation” although the residuals tests look pretty good again.

Some other things I tried that still give somewhat funky results:

  1. Tried using glmmPQL to account for overdispersion (see here: https://rdrr.io/cran/MASS/man/glmmPQL.html) with the non-transformed data and binomial(link=“logit”) family. Also tried with transformed data with beta family. It seems that family=quasibinomial does not work with glmmTMB as I get an error, unfortunately.

  2. Removing replicate altogether (see here: https://www.researchgate.net/post/Why-cant-i-use-the-family-quasi-in-generalised-linear-mixed-model-GLMM) , as I read that small levels as a random effect (I have 8 replicates) can lead to inaccurate estimates of random effects. This seemed to work pretty well, but e.g. why are homo.high and single.high not significantly different?

  3. Using package gamlss for zero-one-inflated beta models (see here: https://cran.r-project.org/web/packages/gamlss/index.html) for a generalized additive model instead of GLMM, but I know GAMs are less specific and come with more freedom aka I need to be more careful. It seems to work pretty well from the summary but I have yet to find a post-hoc option for these models so I’m unsure about my pairwise comparisons. Also I’m unsure if this is the right path/if I’m being careful enough/how to be careful enough, re: https://stats.stackexchange.com/questions/380426/when-to-use-a-gam-vs-glm.

In summary: I am trying to find the correct model to use for my data given the type of data (proportion and one-inflated). So far what I have tried still does not yield p-values that make any sense (i.e. pairwise comparisons that are clearly significantly different in the box plot do not appear as such in the model output). I am comfortable assessing p-values but not much beyond that when it comes to model analysis, which I feel is inhibiting me.

I appreciate any help!

0

There are 0 answers