Applying goodness-of-fit tests for a logistic regression with binary data?

170 views Asked by At

I have a dataset with a binary response variable. I fitted my data using the method of maximum likelihood to a logistic regression and now I want to check the goodness-of-fit. With the logistic regression model I basically want to predict the projectile energy associated with a specific probability P of observing a failed impact test (result).

My data looks like this:

energy = [2.6106, 2.614, 2.6175, 2.6416, 2.6715, 2.6773, 2.6785, 2.6889, 2.6958, 2.7121, 2.7121, 2.7238, 2.7308, 2.7378, 2.7448, 2.7671, 2.7671, 2.7671, 2.7671, 2.7718, 2.7765, 2.7848, 2.7895, 2.7978, 2.8144, 2.8179, 2.8179, 2.8191, 2.8203, 2.8286, 2.8286, 2.8334, 2.8441, 2.8477, 2.8489, 2.8537, 2.8561, 2.8596, 2.8608, 2.8752, 2.8824, 2.8824, 2.8872, 2.892, 2.898, 2.9041, 2.9065, 2.9065, 2.9149, 2.9185, 2.9282, 2.9294, 2.933, 2.9367, 2.9391, 2.9391, 2.9427, 2.9427, 2.9476, 2.9476, 2.9537, 2.9549, 2.9597, 2.961, 2.9658, 2.9695, 2.9719, 2.9744, 2.9756, 2.9805, 2.9817, 2.9817, 2.9866, 2.9902, 2.9951, 2.9964, 2.9964, 3.0074, 3.0197, 3.0283, 3.0332, 3.0369, 3.048, 3.0628, 3.069, 3.0864, 3.0889, 3.1088, 3.1138, 3.1263, 3.1375, 3.1425, 3.1501, 3.1651, 3.2068, 3.2245, 3.2411, 3.2717, 3.2845]

result = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]

For a better understanding, my plot:

enter image description here

I tried using the Shapiro-Wilk-Test and the likelihood-G-Test to test the goodness-of-fit but I am unsure if I used them correctly for my specific data. When using the coefficient of determination R^2, it implies that the logistic regression is a good fit for my data but the mentioned GOF-tests say otherwise. Do you see any flaws?


residuals = result - logistic.cdf(energy)  # Calculate residuals
statistic, p_value = stats.shapiro(residuals)

alpha = 0.05

if p_value < alpha:
    print("The residuals are not normally distributed, which could suggest that the logistic distribution may not fit the data well.")
    print("The residuals are normally distributed, which could suggest that the logistic distribution fits the data well.")


# Estimating the parameters of the logistic distribution
params =

# Calculate the likelihood of the null model (without regression)
null_model_params = (0, 1)  # Null model parameters for the logistic distribution
null_model_log_likelihood = np.sum(logistic.logpdf(energy, *null_model_params))

# Calculate the likelihood of the alternative model (with regression)
alternative_model_log_likelihood = np.sum(logistic.logpdf(energy, *params))

# Calculate the likelihood ratio statistic (G-statistic)
lr_test_statistic = -2 * (null_model_log_likelihood - alternative_model_log_likelihood)

# Degrees of freedom for the test
df = len(params)

# Critical value at a specific significance level alpha (here, 0.05) and df.
alpha = 0.05
critical_value = chi2.ppf(1 - alpha, df)

print(f"Likelihood Ratio Test Statistic (G-statistic): {lr_test_statistic}")
print(f"Critical value at {alpha} significance level and {df} degrees of freedom: {critical_value}")

# Interpretation of the test
if lr_test_statistic > critical_value:
    print("The logistic distribution does not fit the data well.")
    print("The logistic distribution fits the data well.")

Would be a Pearson-chi-Square test the better option?


There are 0 answers