I have a large dataset with around 200 columns and 1 million rows. I have a treatment group, and I'm trying to create a control group using propensity matching score based on about 15 different variables.
I have two questions that I've found conflictual answers online, and I would appreciate it if you could help me out.
1) How to organize the data to best run the matching process? My data has a mix of numeric, character, and factor (some ordered, others not) variables, and I've seen online some people saying that the MatchIt program runs the analysis with character variables, while others saying that it does not work for the 'nearest' function but works with other ones. So, should I put some effort into converting everything into numeric or factor (which I'm not sure it will be possible), or can I run the MatchIt with my variables as they are?
2) Has the function MatchIt been updated to read NAs in variables that are not used for the matching function? I've seen some old posts saying that the MatchIt needed a COMPLETE dataset, even for the variables that were not being used for matching, but these posts also said that it was something that would probably be fixed. Is it still the case?
Thanks
1) Beyond the data type, the question you should ask yourself is what sense it makes to give categorical data to a propensity score setting. Propensity scores are based on distances between observations, and calculating distances between categorical attributes is obviously difficult. So even though technically speaking,
MatchIt
does support other types, numeric features is the only really sensible data input. You can either choose to discard the categorical data from your data or convert it to numeric (by creating dummy variables and numerically encoding ordinal features). Alternatively, you can keep the categorical features and impose exact matching on these features using theexact
parameter of thematchit
function (note that in this case, you are not really using propensity score matching anymore..).2) This issue has not been solved in the current version 3.0.2, which is obviously annoying..