I am trying to carry out logistic regression on a dataset. I have converted Age (categorical variable) to binary (0 = "Adult", 1 = "Immature"). Tail length is a continuous numerical variable and I want to predict the probability that an animal with a tail length of greater than 220mm is immature.
There is a large difference in the sample size between both ages as shown here:
table(rt$Age)
# Adult Immature
120 448
Some code:
rt$bin_age <- rt$Age # Create a separate vector to convert Age to binary
rt$bin_age <- recode(rt$Age, "A" = 0, "I" = 1)
library(ggplot2)
ggplot(rt, aes(x = Tail,
y = bin_age)) +
geom_jitter(color = "blue",
size = 3,
height = 0.04,
width = 0.2,
alpha = 0.5) +
geom_smooth(method = "loess", size = 1,
col = "red", lty = 2, se = FALSE) +
labs(x = "Tail Length (in mm)", y = "Sex") +
theme_classic()
When I plot the data using ggplot, I get the following image:
Rather than producing a nice outright sigmoidal curve, it produces more of a "sideways S" curve.
I identified three outliers less than 175mm, so I removed them:
# Use 175mm as the cut-off = remove values <175mm
which(rt$Tail < 175) # Rows 261, 317 and 361
# Remove rows 261, 317 and 361
rt <- rt[-c(261, 317, 361) ,]
and got this image:
Has this occurred because of the difference in sample size between the two populations? Is there a way to equal up the sample size (e.g. through looped sub-sampling or something) so I can interpret this more appropriately?
I also ran a visreg()
graph with the outliers left in and I'm not sure whether it is more appropriate to use?
age_glm <- glm(bin_age ~ Tail,
family = binomial(link = "logit"),
data = rt)
summary(age_glm)
visreg(age_glm, xvar = "Tail", scale = "response", rug = FALSE)
points(jitter(bin_age, 0.2) ~ Tail, ylim = c(-0.1, 1.1),
data = rt,
pch = 20, col = "black", cex = 1, lwd = 1)
Giving me this graph:
It still looks a bit funky...
Note that I have used a Kruskal-Wallis test to test for differences between the ages with regards to tail length, and P <0.001, so I was expecting a more marked difference in the graphs.