Fastest way of finding matching rows

1.7k views Asked by At

I am wondering what is the fastest way of finding all rows in xts object that are the same as one particular row

library(xts)

nRows <- 3

coreData <- data.frame(a=rnorm(nRows), b=rnorm(nRows), c=rnorm(nRows))

testXts1 <- xts(coreData, order.by=as.Date(1:nRows))
testXts2 <- xts(coreData, order.by=as.Date((nRows + 1):(2*nRows)))
testXts3 <- xts(coreData, order.by=as.Date((2*nRows + 1):(3*nRows)))

testXts <- rbind(testXts1, testXts2, testXts3)

> testXts
                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-03 -0.7105016  1.639239 -2.056861
1970-01-04  0.1138675 -1.782825 -1.081799
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-06 -0.7105016  1.639239 -2.056861
1970-01-07  0.1138675 -1.782825 -1.081799
1970-01-08 -0.3288756  1.441799  1.321608
1970-01-09 -0.7105016  1.639239 -2.056861
1970-01-10  0.1138675 -1.782825 -1.081799

rowToSearch <- first(testXts)

> rowToSearch
                    a        b        c
1970-01-02 -0.3288756 1.441799 1.321608

indicesOfMatchingRows <- unlist(apply(testXts, 1, function(row)  lapply(1:NCOL(row), function(i) row[i] == coredata(rowToSearch[, i]))))

testXts[indicesOfMatchingRows, ]

                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-08 -0.3288756  1.441799  1.321608

I am sure this can be done in more elegant and fast way.

A more general question is how you say in R "I have this row matrix[5, ] how can I find (indexes of) other rows in matrix that are the same as matrix[5, ]".

How to do this in data.table?

4

There are 4 answers

3
josliber On BEST ANSWER

Since you said that speed is your main concern, you can get speedups even over a data.table solution with Rcpp:

library(Rcpp)
cppFunction(
"LogicalVector compareToRow(NumericMatrix x, NumericVector y) {
  const int nr = x.nrow();
  const int nc = x.ncol();
  LogicalVector ret(nr, true);
  for (int j=0; j < nr; ++j) {
    for (int k=0; k < nc; ++k) {
      if (x(j, k) != y[k]) {
        ret[j] = false;
        break;
      }
    }
  }
  return ret;
}")
testXts[compareToRow(testXts, rowToSearch),]
#                   a         b         c
# 1970-01-02 1.324457 0.8485654 -1.464764
# 1970-01-05 1.324457 0.8485654 -1.464764
# 1970-01-08 1.324457 0.8485654 -1.464764

Here's a comparison on a fairly large instance (with 1 million rows):

set.seed(144)
bigXts <- testXts[sample(nrow(testXts), 1000000, replace=TRUE),]
testDT <- as.data.frame(bigXts)

josilber <- function(x, y) x[compareToRow(x, y),]
roland.base <- function(x, y) x[colSums(t(x) != as.vector(y)) == 0L,]
library(data.table)
roland.dt <- function(testDT, y) {
  setDT(testDT, keep.rownames=TRUE)
  setkey(testDT, a, b, c)
  testDT[setDT(as.data.frame(y))]
}
library(microbenchmark)
microbenchmark(josilber(bigXts, rowToSearch), roland.base(bigXts, rowToSearch), roland.dt(testDT, rowToSearch), times=10)
# Unit: milliseconds
#                              expr         min         lq       mean     median         uq       max
#     josilber(bigXts, rowToSearch)    7.830986   10.24748   45.64805   14.41775   17.37049  258.4404
#  roland.base(bigXts, rowToSearch) 3530.042324 3964.72314 4288.05758 4179.64233 4534.21407 5400.5619
#    roland.dt(testDT, rowToSearch)   32.826285   34.95014  102.52362   57.30213  130.51053  267.2249

This benchmark assumes the object has been converted to a data frame (~4 seconds overhead) before calling the roland.dt and that compareToRows has been compiled (~3 seconds overhead) before calling josilber. The Rcpp solution is about 300x faster than the base R solution and about 4x faster than the data.table solution in median runtime. The approach based on digest was not competitive, taking more than 60 seconds to execute each time.

0
Roland On

Here is a faster base R solution:

ind <- colSums(t(testXts) != as.vector(rowToSearch)) == 0L
testXts[ind,]

Here is a solution using a data.table join:

library(data.table)
testDT <- as.data.frame(testXts)
setDT(testDT, keep.rownames=TRUE)
setkey(testDT, a, b, c)
testDT[setDT(as.data.frame(rowToSearch))]

However, I would be wary when comparing floating point numbers.

3
Rorschach On

This doesn't use data.table but could be quite fast. You could do this by hashing rows,

library(digest)
hash <- apply(testXts, 1, digest)
testXts[which(hash[1] == hash)]

#                    a          b          c
# 1970-01-02 0.8466816 -0.7129076 -0.5742323
# 1970-01-05 0.8466816 -0.7129076 -0.5742323
# 1970-01-08 0.8466816 -0.7129076 -0.5742323
0
C8H10N4O2 On

The simplest data.table solution is probably:

merge(as.data.table(testXts), as.data.table(rowToSearch, keep.rownames=FALSE))

Returns:

          a          b         c      index
1: 1.685138 -0.3039018 -1.550871 1970-01-02
2: 1.685138 -0.3039018 -1.550871 1970-01-05
3: 1.685138 -0.3039018 -1.550871 1970-01-08

Why this works:

merge = inner join on common columns if not otherwise specified. This inner join returns only columns with same values of (a, b, c) as rowToSearch.

keep.rownames=FALSE on right hand side ensures that the date index of rowToSearch (which is not wanted) is dropped and does not enter the common columns for joining.