Succint way of handling missing observations in numpy.cov?

2.6k views Asked by At

As the title suggests, I'm wondering whether there's a succinct way for handling missing data when calculating covariance matrices in Python/pandas. Consider the dataframe

df = pd.DataFrame({'var1': [1,2,3, np.nan, 5, np.nan], 'var2': [1, 1.5, 2, 2.5, np.nan, 3.5]})

If we were to simply do np.cov(df.var1.dropna(), df.var2.dropna()), we'd get an error as there are a different number of missing values in columns one and two. Two ways of getting around this I found were:

rowind = list(set(df.var1.dropna().index).intersection(set(df.var2.dropna().index)))

and

rowind = (~np.isnan(data.resid1f1)) & (~np.isnan(data.resid1f2))

and then computing np.cov(df.loc[rowind, "var1"], df.loc[rowind, "var2"]). I am however wondering whether there's some built-in function somewhere that could do this in a less verbose way.

2

There are 2 answers

1
EdChum On BEST ANSWER

Call dropna and then cov:

In [110]:
df.dropna().cov()

Out[110]:
      var1  var2
var1   1.0  0.50
var2   0.5  0.25

This matches np.cov:

In [111]:
rowind = (~np.isnan(df.var1)) & (~np.isnan(df.var2))
np.cov(df.loc[rowind, "var1"], df.loc[rowind, "var2"])

Out[111]:
array([[ 1.  ,  0.5 ],
       [ 0.5 ,  0.25]])

This is different to df.cov which gives different results, the docs states it excludes missing data but it's unclear what it does with them:

In [107]:
df.cov()

Out[107]:
          var1   var2
var1  2.916667  0.500
var2  0.500000  0.925

OK just figured out what the above is doing:

In [115]:
df.fillna(df.mean(axis=1)).cov()

Out[115]:
          var1   var2
var1  2.916667  0.500
var2  0.500000  0.925
0
mmngreco On

I make the below gist to reproduce and exemplify that difference, I hope that could be useful in this discussion:

# =============================================================================
# NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES
# =============================================================================
# data with nan values
M = np.random.randn(10,2)
# add missing values
M[0,0] = np.nan
M[1,1] = np.nan

# Covariance matrix calculations
# ==============================
# numpy
# -----
masked_arr = np.ma.array(M, mask=np.isnan(M))
cov_numpy = np.ma.cov(masked_arr, rowvar=0, allow_masked=True, ddof=1).data

# pandas
# ------
cov_pandas = pd.DataFrame(M).cov(min_periods=0).values

The below shows the example of the calcs:

# Homemade covariance coefficient calculation
# (what each of them is actually doing)
# =============================================
# select elements to estimate the element 0,1 in the covariance matrix
x = M[:,0]
y = M[:,1]

mask_x = ~np.isnan(x)
mask_y = ~np.isnan(y)
mask_common = mask_x & mask_y

# numpy
# -----
xn = x-np.mean(x[mask_x])
yn = y-np.mean(y[mask_y])
cov_np = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

# pandas
# ------
xn = x-np.mean(x[mask_common])
yn = y-np.mean(y[mask_common])
cov_pd = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

Note that the main difference is on how to apply the mean function over each variables.

source: https://gist.github.com/mmngreco/bd86213d9ccd8ddc61683a853ce2fced


Edit: I added an issue in pandas:

https://github.com/pandas-dev/pandas/issues/16837