Are outliers affecting the shape of my logistic regression curve or is it the fact the samples are unbalanced?

203 views Asked by emmz At 02 February 2021 at 16:58

I am trying to carry out logistic regression on a dataset. I have converted Age (categorical variable) to binary (0 = "Adult", 1 = "Immature"). Tail length is a continuous numerical variable and I want to predict the probability that an animal with a tail length of greater than 220mm is immature.

There is a large difference in the sample size between both ages as shown here:

table(rt$Age)

#  Adult   Immature
120         448

Some code:

rt$bin_age <- rt$Age # Create a separate vector to convert Age to binary

rt$bin_age <- recode(rt$Age, "A" = 0, "I" = 1)

library(ggplot2)
ggplot(rt, aes(x = Tail,
                 y = bin_age)) +
  geom_jitter(color = "blue", 
              size = 3, 
              height = 0.04,
              width = 0.2,
              alpha = 0.5) +
  geom_smooth(method = "loess", size = 1, 
              col = "red", lty = 2, se = FALSE) +
  labs(x = "Tail Length (in mm)", y = "Sex") +
  theme_classic()

When I plot the data using ggplot, I get the following image:

plot of tail length (x) vs sex (y)

Rather than producing a nice outright sigmoidal curve, it produces more of a "sideways S" curve.

I identified three outliers less than 175mm, so I removed them:

# Use 175mm as the cut-off = remove values <175mm

which(rt$Tail < 175) #  Rows 261, 317 and 361

# Remove rows 261, 317 and 361
rt <- rt[-c(261, 317, 361) ,]

and got this image:

plot of tail length (x) vs sex (y)

Has this occurred because of the difference in sample size between the two populations? Is there a way to equal up the sample size (e.g. through looped sub-sampling or something) so I can interpret this more appropriately?

I also ran a visreg() graph with the outliers left in and I'm not sure whether it is more appropriate to use?

age_glm <- glm(bin_age ~ Tail,
               family = binomial(link = "logit"),
               data = rt)

summary(age_glm)

visreg(age_glm, xvar = "Tail", scale = "response", rug = FALSE)
points(jitter(bin_age, 0.2) ~ Tail, ylim = c(-0.1, 1.1),
        data = rt, 
        pch = 20, col = "black", cex = 1, lwd = 1)

Giving me this graph:

    It still looks a bit funky...

Note that I have used a Kruskal-Wallis test to test for differences between the ages with regards to tail length, and P <0.001, so I was expecting a more marked difference in the graphs.

Original Q&A

TechQA.

Are outliers affecting the shape of my logistic regression curve or is it the fact the samples are unbalanced?

There are 0 answers

Related Questions in R

Related Questions in BINARY

Related Questions in LOGISTIC-REGRESSION

Related Questions in VISREG

Popular Questions

Popular Tags

Trending Questions