Applying goodness-of-fit tests for a logistic regression with binary data?

155 views Asked by At

I have a dataset with a binary response variable. I fitted my data using the method of maximum likelihood to a logistic regression and now I want to check the goodness-of-fit. With the logistic regression model I basically want to predict the projectile energy associated with a specific probability P of observing a failed impact test (result).

My data looks like this:

energy = [2.6106, 2.614, 2.6175, 2.6416, 2.6715, 2.6773, 2.6785, 2.6889, 2.6958, 2.7121, 2.7121, 2.7238, 2.7308, 2.7378, 2.7448, 2.7671, 2.7671, 2.7671, 2.7671, 2.7718, 2.7765, 2.7848, 2.7895, 2.7978, 2.8144, 2.8179, 2.8179, 2.8191, 2.8203, 2.8286, 2.8286, 2.8334, 2.8441, 2.8477, 2.8489, 2.8537, 2.8561, 2.8596, 2.8608, 2.8752, 2.8824, 2.8824, 2.8872, 2.892, 2.898, 2.9041, 2.9065, 2.9065, 2.9149, 2.9185, 2.9282, 2.9294, 2.933, 2.9367, 2.9391, 2.9391, 2.9427, 2.9427, 2.9476, 2.9476, 2.9537, 2.9549, 2.9597, 2.961, 2.9658, 2.9695, 2.9719, 2.9744, 2.9756, 2.9805, 2.9817, 2.9817, 2.9866, 2.9902, 2.9951, 2.9964, 2.9964, 3.0074, 3.0197, 3.0283, 3.0332, 3.0369, 3.048, 3.0628, 3.069, 3.0864, 3.0889, 3.1088, 3.1138, 3.1263, 3.1375, 3.1425, 3.1501, 3.1651, 3.2068, 3.2245, 3.2411, 3.2717, 3.2845]

result = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]

For a better understanding, my plot:

enter image description here

I tried using the Shapiro-Wilk-Test and the likelihood-G-Test to test the goodness-of-fit but I am unsure if I used them correctly for my specific data. When using the coefficient of determination R^2, it implies that the logistic regression is a good fit for my data but the mentioned GOF-tests say otherwise. Do you see any flaws?

Shapiro-Wilk-test:

residuals = result - logistic.cdf(energy)  # Calculate residuals
statistic, p_value = stats.shapiro(residuals)

alpha = 0.05

if p_value < alpha:
    print("The residuals are not normally distributed, which could suggest that the logistic distribution may not fit the data well.")
else:
    print("The residuals are normally distributed, which could suggest that the logistic distribution fits the data well.")

G-Test:

# Estimating the parameters of the logistic distribution
params = logistic.fit(energy)

# Calculate the likelihood of the null model (without regression)
null_model_params = (0, 1)  # Null model parameters for the logistic distribution
null_model_log_likelihood = np.sum(logistic.logpdf(energy, *null_model_params))

# Calculate the likelihood of the alternative model (with regression)
alternative_model_log_likelihood = np.sum(logistic.logpdf(energy, *params))

# Calculate the likelihood ratio statistic (G-statistic)
lr_test_statistic = -2 * (null_model_log_likelihood - alternative_model_log_likelihood)

# Degrees of freedom for the test
df = len(params)

# Critical value at a specific significance level alpha (here, 0.05) and df.
alpha = 0.05
critical_value = chi2.ppf(1 - alpha, df)

print(f"Likelihood Ratio Test Statistic (G-statistic): {lr_test_statistic}")
print(f"Critical value at {alpha} significance level and {df} degrees of freedom: {critical_value}")

# Interpretation of the test
if lr_test_statistic > critical_value:
    print("The logistic distribution does not fit the data well.")
else:
    print("The logistic distribution fits the data well.")

Would be a Pearson-chi-Square test the better option?

0

There are 0 answers