I am an ornithologist currently conducting a study on habitat use and selection in three animal species in agricultural habitats.
I have been using the survival package in R for my analyses, particularly the clogit function.
To provide more context, my dataset consists of over 2400 records, and each habitat use site is paired with a random control site. The paired observations (use site + a randomly control site) share the same code under the variable "group." For each habitat use site, I have collected information on three variables: the observed species (categorical factor with three levels, "species"), the vegetation height (continuous variable, "vegetation.height" in cm), and the type of vegetation structure (categorical factor with three levels, "vegetation.structure"). You can download it from: https://docs.google.com/spreadsheets/d/1B7xalNuI5IQLvEcAa-OaFESjwgJrP-uA/edit?usp=sharing&ouid=100938199232902549570&rtpof=true&sd=true
I am encountering a challenge with the clogit function when considering certain fixed terms in the model.
When I construct the model with "vegetation.height + vegetation.height:species" as fixed terms, everything works smoothly. However, when I try a similar approach with "vegetation.structure + vegetation.structure:species," (both categorical factor with three levels) I encounter NA coefficients for the "type3" level of "vegetation.structure" in the interaction term.
library(readxl)
library(survival)
library(earth)
# import dataset
data <- read_excel("dataset.xlsx", sheet = "sheet1")
# run a clogit regression model to study influence of vegetation.height on use, by species
mod1 <- clogit(use ~ vegetation.height +
vegetation.height:species +
strata(group),
data=data)
summary(mod1)
# run a second clogit regression model
# study influence of vegetation.structure on use, by species
mod2 <- clogit(use ~ vegetation.structure +
vegetation.structure:species +
strata(group),
data=data)
summary(mod2)
# NA coefficients produced by clogit!
Here's the summary output of mod2:
> summary(mod2)
Call:
coxph(formula = Surv(rep(1, 2418L), use) ~ vegetation.structure +
vegetation.structure:species + strata(group), data = data,
method = "exact")
n= 2418, number of events= 1209
coef exp(coef) se(coef) z Pr(>|z|)
vegetation.structuretype2 -2.400984 0.090629 0.507924 -4.727 2.28e-06 ***
vegetation.structuretype3 -4.040649 0.017586 0.459174 -8.800 < 2e-16 ***
vegetation.structuretype1:speciesspecies2 -2.054953 0.128099 0.518891 -3.960 7.49e-05 ***
vegetation.structuretype2:speciesspecies2 -0.004336 0.995673 0.511864 -0.008 0.99324
vegetation.structuretype3:speciesspecies2 NA NA 0.000000 NA NA
vegetation.structuretype1:speciesspecies3 -1.322921 0.266356 0.605789 -2.184 0.02898 *
vegetation.structuretype2:speciesspecies3 -1.015964 0.362053 0.378729 -2.683 0.00731 **
vegetation.structuretype3:speciesspecies3 NA NA 0.000000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
vegetation.structuretype2 0.09063 11.034 0.03349 0.24525
vegetation.structuretype3 0.01759 56.863 0.00715 0.04325
vegetation.structuretype1:speciesspecies2 0.12810 7.806 0.04633 0.35418
vegetation.structuretype2:speciesspecies2 0.99567 1.004 0.36510 2.71529
vegetation.structuretype3:speciesspecies2 NA NA NA NA
vegetation.structuretype1:speciesspecies3 0.26636 3.754 0.08125 0.87320
vegetation.structuretype2:speciesspecies3 0.36205 2.762 0.17234 0.76058
vegetation.structuretype3:speciesspecies3 NA NA NA NA
Concordance= 0.715 (se = 0.012 )
Likelihood ratio test= 531.8 on 6 df, p=<2e-16
Wald test = 221.7 on 6 df, p=<2e-16
Score (logrank) test = 433.3 on 6 df, p=<2e-16
I am puzzled by this issue and suspect that the regression might be considering the "type3" level of "vegetation.structure" as the reference, estimating coefficients as differences from this level. However, my goal would be to obtain estimates for all coefficients to understand the species comparisons.
I manipulated the dataset by introducing a significantly imbalanced number of observations for habitat use (1) in the "type3" level of "vegetation.structure". This manipulation did not resolve the issue, leading me to believe that the NA coefficients are not a result of an equal number of observations between habitat use and random conditions.
Moreover, during dataset manipulation, I observed instances where the model considers either "type1" or "type2" of "vegetation.structure" as the reference level, resulting in NA coefficients. This inconsistency adds to my confusion, as I am striving to obtain reliable estimates for all levels of "vegetation.structure" to facilitate species comparisons.
I greatly appreciate your expertise and was wondering if you could provide any insights or guidance on how to address this matter.