Best subset of matrix with minimum number of co-occuring non NA values?

129 views Asked by At

I have a N by T matrix of data (N >>T), which have many missing NA values. I am searching for the best row-subset of my matrix, such that any two rows N_i and N_j will have at least k>=2 co-occurring (i.e. occurring at same time t) non-NA values. To motivate this, my end goal of finding this subset is to be able to compute a pairwise correlation matrix (using R's cor(., use="pairwise.complete.obs") as initial estimator for a shrinkage procedure.

Formally, if X is my raw matrix, let Y be the matrix where each entry has value 1 if corresponding X entry has an observed value, 0 for NA. Then YY' will be a N x N matrix counting co-occurrences of non-NA values. Initially this YY' matrix will contain 0 values, indicating that row N_i and N_j have 0 co-occurring values. I want to find the largest subset of X such that the minimum of YY' is say k=2 , or 3, etc.

Is there any algorithm to solve this question, with ideally an implementation in R? I have a feeling this YY' matrix can be considered like a graph, and that there might exist known procedures for this question (clique finding? Steiner tree?) but I am not knowledgeable enough in this domain... I can think of a greedy iterative algorithm removing the worst row each time and re-computing YY, but was hoping there would be a more elegant and faster solution?

Quick simulation of the data:

set.seed(123)
X <- matrix(rnorm(200*20), 200, 20)
X[sample(1:(200*20), 200*20/2)] <- NA
Y <- 1*!is.na(X)
YY <- tcrossprod(Y)
YY[1:3,1:3]
#>      [,1] [,2] [,3]
#> [1,]    9    6    3
#> [2,]    6   10    4
#> [3,]    3    4   11
min(YY)
#> [1] 0

Created on 2020-09-28 by the reprex package (v0.3.0)

1

There are 1 answers

5
David Eisenstat On

Unfortunately this problem is NP-hard. Given a graph with n vertices and m edges, we can reduce the problem of finding a max clique to solving this problem with k=1 on the vertex-edge incidence matrix (duplicate columns for higher k).

I'd turn the reduction around in the obvious way and reduce this problem to max clique.

The vertices of the graph correspond to rows. There is an edge between two vertices iff the corresponding rows have enough non-NA coöccurring values. To find the max clique, you can call out to the igraph R package.