To compute propensity scores, I want to estimate cross-sectional binary response regression models on a year-by-year basis using statsmodel LogisticRegression. As explanatory variables I consider firm characteristics, and the treatment group states if in the sample or not. The estimation results are confusing, suggesting possibly complete quasi-separation.
How can I address the poor model assessment metrics and the complete quasi-separation?
df = pd.read_excel("Posthoc/PSM_firms_combined.xlsx")
X_year = df[df['Year'] == 2015][['Total Assets', 'Growth', 'Price to Book Value per Share', 'Total Debt to Common Equity']]
y_year = df[df['Year'] == 2015]['Treatment']
logit_model = sm.Logit(y_year, X_year)
results = logit_model.fit()
print(results.summary())
# Obtain the chi-squared statistic
chi_squared = results.llr
print("Chi-Squared Statistic:", chi_squared)
# Calculate McFadden's R-squared
log_likelihood_model = results.llf # Log-likelihood of the model
log_likelihood_null = results.llnull # Log-likelihood of a null model
mcfadden_r2 = 1 - (log_likelihood_model / log_likelihood_null)
print("McFadden's R-squared:", mcfadden_r2)
# Obtain AUC-ROC
y_pred_prob = results.predict(X_year)
auc_roc = roc_auc_score(y_year, y_pred_prob)
print("AUC-ROC:", auc_roc)
I tried without success logit_model.fit(method = 'bfgs')
and the FirthLogisticRegression.