As the title suggests, I'm wondering whether there's a succinct way for handling missing data when calculating covariance matrices in Python/pandas. Consider the dataframe
df = pd.DataFrame({'var1': [1,2,3, np.nan, 5, np.nan], 'var2': [1, 1.5, 2, 2.5, np.nan, 3.5]})
If we were to simply do np.cov(df.var1.dropna(), df.var2.dropna())
, we'd get an error as there are a different number of missing values in columns one and two.
Two ways of getting around this I found were:
rowind = list(set(df.var1.dropna().index).intersection(set(df.var2.dropna().index)))
and
rowind = (~np.isnan(data.resid1f1)) & (~np.isnan(data.resid1f2))
and then computing np.cov(df.loc[rowind, "var1"], df.loc[rowind, "var2"])
. I am however wondering whether there's some built-in function somewhere that could do this in a less verbose way.
Call
dropna
and thencov
:This matches
np.cov
:This is different to
df.cov
which gives different results, the docs states it excludes missing data but it's unclear what it does with them:OK just figured out what the above is doing: