Fastest way of finding matching rows

Question

Fastest way of finding matching rows

1.7k views Asked by Samo At 20 June 2015 at 15:37

I am wondering what is the fastest way of finding all rows in xts object that are the same as one particular row

library(xts)

nRows <- 3

coreData <- data.frame(a=rnorm(nRows), b=rnorm(nRows), c=rnorm(nRows))

testXts1 <- xts(coreData, order.by=as.Date(1:nRows))
testXts2 <- xts(coreData, order.by=as.Date((nRows + 1):(2*nRows)))
testXts3 <- xts(coreData, order.by=as.Date((2*nRows + 1):(3*nRows)))

testXts <- rbind(testXts1, testXts2, testXts3)

> testXts
                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-03 -0.7105016  1.639239 -2.056861
1970-01-04  0.1138675 -1.782825 -1.081799
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-06 -0.7105016  1.639239 -2.056861
1970-01-07  0.1138675 -1.782825 -1.081799
1970-01-08 -0.3288756  1.441799  1.321608
1970-01-09 -0.7105016  1.639239 -2.056861
1970-01-10  0.1138675 -1.782825 -1.081799

rowToSearch <- first(testXts)

> rowToSearch
                    a        b        c
1970-01-02 -0.3288756 1.441799 1.321608

indicesOfMatchingRows <- unlist(apply(testXts, 1, function(row)  lapply(1:NCOL(row), function(i) row[i] == coredata(rowToSearch[, i]))))

testXts[indicesOfMatchingRows, ]

                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-08 -0.3288756  1.441799  1.321608

I am sure this can be done in more elegant and fast way.

A more general question is how you say in R "I have this row matrix[5, ] how can I find (indexes of) other rows in matrix that are the same as matrix[5, ]".

How to do this in data.table?

Original Q&A

There are 4 answers

Roland On 20 June 2015 at 16:07

Here is a faster base R solution:

ind <- colSums(t(testXts) != as.vector(rowToSearch)) == 0L
testXts[ind,]

Here is a solution using a data.table join:

library(data.table)
testDT <- as.data.frame(testXts)
setDT(testDT, keep.rownames=TRUE)
setkey(testDT, a, b, c)
testDT[setDT(as.data.frame(rowToSearch))]

However, I would be wary when comparing floating point numbers.

Rorschach On 20 June 2015 at 16:17

This doesn't use data.table but could be quite fast. You could do this by hashing rows,

library(digest)
hash <- apply(testXts, 1, digest)
testXts[which(hash[1] == hash)]

#                    a          b          c
# 1970-01-02 0.8466816 -0.7129076 -0.5742323
# 1970-01-05 0.8466816 -0.7129076 -0.5742323
# 1970-01-08 0.8466816 -0.7129076 -0.5742323

C8H10N4O2 On 23 August 2017 at 03:53

The simplest data.table solution is probably:

merge(as.data.table(testXts), as.data.table(rowToSearch, keep.rownames=FALSE))

Returns:

          a          b         c      index
1: 1.685138 -0.3039018 -1.550871 1970-01-02
2: 1.685138 -0.3039018 -1.550871 1970-01-05
3: 1.685138 -0.3039018 -1.550871 1970-01-08

Why this works:

merge = inner join on common columns if not otherwise specified. This inner join returns only columns with same values of (a, b, c) as rowToSearch.

keep.rownames=FALSE on right hand side ensures that the date index of rowToSearch (which is not wanted) is dropped and does not enter the common columns for joining.

**josliber** · Accepted Answer · 2015-06-20T16:59:24+00:00

Since you said that speed is your main concern, you can get speedups even over a data.table solution with Rcpp:

library(Rcpp)
cppFunction(
"LogicalVector compareToRow(NumericMatrix x, NumericVector y) {
  const int nr = x.nrow();
  const int nc = x.ncol();
  LogicalVector ret(nr, true);
  for (int j=0; j < nr; ++j) {
    for (int k=0; k < nc; ++k) {
      if (x(j, k) != y[k]) {
        ret[j] = false;
        break;
      }
    }
  }
  return ret;
}")
testXts[compareToRow(testXts, rowToSearch),]
#                   a         b         c
# 1970-01-02 1.324457 0.8485654 -1.464764
# 1970-01-05 1.324457 0.8485654 -1.464764
# 1970-01-08 1.324457 0.8485654 -1.464764

Here's a comparison on a fairly large instance (with 1 million rows):

set.seed(144)
bigXts <- testXts[sample(nrow(testXts), 1000000, replace=TRUE),]
testDT <- as.data.frame(bigXts)

josilber <- function(x, y) x[compareToRow(x, y),]
roland.base <- function(x, y) x[colSums(t(x) != as.vector(y)) == 0L,]
library(data.table)
roland.dt <- function(testDT, y) {
  setDT(testDT, keep.rownames=TRUE)
  setkey(testDT, a, b, c)
  testDT[setDT(as.data.frame(y))]
}
library(microbenchmark)
microbenchmark(josilber(bigXts, rowToSearch), roland.base(bigXts, rowToSearch), roland.dt(testDT, rowToSearch), times=10)
# Unit: milliseconds
#                              expr         min         lq       mean     median         uq       max
#     josilber(bigXts, rowToSearch)    7.830986   10.24748   45.64805   14.41775   17.37049  258.4404
#  roland.base(bigXts, rowToSearch) 3530.042324 3964.72314 4288.05758 4179.64233 4534.21407 5400.5619
#    roland.dt(testDT, rowToSearch)   32.826285   34.95014  102.52362   57.30213  130.51053  267.2249

This benchmark assumes the object has been converted to a data frame (~4 seconds overhead) before calling the roland.dt and that compareToRows has been compiled (~3 seconds overhead) before calling josilber. The Rcpp solution is about 300x faster than the base R solution and about 4x faster than the data.table solution in median runtime. The approach based on digest was not competitive, taking more than 60 seconds to execute each time.

TechQA.

Fastest way of finding matching rows

There are 4 answers

Related Questions in R

Related Questions in DATA.TABLE

Related Questions in XTS

Popular Questions

Popular Tags

Trending Questions