I am trying to construct a predictive model for a dataset with a dependent variable (Y) which consists of non-negative integer values with a maximum of 100. Below is a histogram plot of Y which shows its very much skewed to the right. Furthermore, I have around 100 possible explanatory variables (X) and a varying number of observations per 'individual', better said an unbalanced panel dataset with a total of around N = 50.000 observations. Also, for some individuals only one observation may exist. Unfortunately I can't share the original data but I've included plots where possible.
I hope to get some feedback on the steps I've taken thusfar.
I checked whether a log-transform of the dependent variable was somewhat normal distributed but this turned out not to be the case as shown by the log-transformed histogram below.
Due to the large number of zeros in my data and the fact that my dependent variable is non-negative and integer, I thought about first constructing both a zero-inflated Poisson model and a pooled Poisson model (using the pglm R-package). As the Poisson model assumes the variance of the dependent variable equals the mean, I checked this assumption by simply computing the sample mean and sample variance which turned out to be around 14 and 349 respectively, which appear to me to be sufficiently far apart that they I can conclude that these are not equal. Also, I read here that I can test it using an F-test. From here on I thought about constructing both a zero-inflated and pooling model using the negative-binomial distribution as it can handle overdispersion.
Then I checked to what extend my data fits a negative binomial distribution using a QQ-plot. Using the following code
fit_distr_nb <- fitdistr(x = Y, densfun = 'Negative Binomial')
plot(qnbinom(ppoints(Y), size=0.767006784, mu=14.289099815),
sort(Y),
xlab = 'Theoretical Values',
ylab = 'Sample values') +
abline(0,1)
Which resulted in the following plot
Which doesn't look too bad except for the the top-right part of the plot I guess.
After this I want to test for random effects using the following packages lme4 and glmmADMB, where the second package can be used for zero-inflated models with random and fixed effects. From this post I read that it can be checked by running the following code for the lme4 package, where I used the size parameter, found from fitting a negative binomial distribution to the data, as my theta.
theta <- 0.767
pool.test <- glm(Y ~ X, data = dataset, family = 'negbin'(theta))
re.test <- glmer(Y ~ X + (1|ID), data = dataset, family = 'negbin'(theta), REML = F)
And checking whether an IC, such as the AIC, is lower for the model including random effects. Would this be a correct approach?
Furthermore, could I extent this result to the zero-inflated model?
Furthermore, I was wondering if there could be any issues concerning the distribution of the number of observations per individual in an unbalanced panel data set. As can bee seen from the histogram below, I have a lot of individuals with only one or two observations.