I have a data frame which contains some characteristics from clients and contracts and 0s and 1s showing whether a fall happened the period between 2008 and 2017. I'm using a binomial model to regress probability of fall on the characteristics. I have 38000 differents contracts.
So I'm using an binomial model like this (R-code):
formule <- y ~ Niveau_gar_incapacite + Niv_indem_mens + Regrpt_franchise + Niveau_prime + Situation_familiale + Classe_age_chute + Grde_Region + Regrpt_strate + Taille_courtier + Commission + Retention + Anciennete + Regrpt_CSP + Regrpt_sinistres + Couplage
logit <- glm(Chute_commerciale~1, data=train, family=binomial(link="logit"))
selection_asc_AIC <- step(logit, direction="forward", trace=TRUE, k=2, scope=list(upper=formule))
After some tests to find multi-collinearity, I did eliminations of variables or groupings of terms. I have this result :
results from GLM
results from GLM 2
This results are not correct with null deviance and residual deviance.
I supposed my variable exposure that is the problem. In fact, I have contracts beginning and finishing at differents years. So my exposure can be 5.32 or 1.36 and I have truncation and censorship.
How can I treat this variable exposure in regression logistic binomial ? If I duplicate my row by the number of year of exposure, there is a problem of independance of observations.