This is a Causal Inference related question, specifically on how to handle unbalanced variables. I applied an XGBoost model to create propensity scores for users (found that XGBoost had higher accuracy, precision & AUC compared to Logistic Regression). When estimating the Standardized Mean Differences (SMDs) for the balanced variables between the control & treatment, there is one feature (user age, ranked high in feature gain/importance) which is above the SMD threshold of 0.1. Some things I have tried to remediate this:

  • Increased the sample size of control and treatment
  • Downsampled to ensure treated users are not duplicatively matched to control users
  • Re-sampled the data to make sure the training data age group distribution is proportionally the same for control and treatment

I'm stuck! How can I make sure that the SMD for user age is below 0.1? Unsure how to move forward with this confounding variable, as it is a highly important feature. Any help would be greatly appreciated.

0

There are 0 answers