Porting set operations from R's data frames to data tables: How to identify duplicated rows?

998 views Asked by At

[Update 1: As Matthew Dowle noted, I'm using data.table version 1.6.7 on R-Forge, not CRAN. You won't see the same behavior with an earlier version of data.table.]

As background: I am porting some little utility functions to do set operations on rows of a data frame or pairs of data frames (i.e. each row is an element in a set), e.g. unique - to create a set from a list, union, intersection, set difference, etc. These mimic Matlab's intersect(...,'rows'), setdiff(...,'rows'), etc., which don't appear to have counterparts in R (R's set operations are limited to vectors and lists, but not rows of matrices or data frames). Examples of these little functions are below. If this functionality for data frames already exists in some package or base R, I'm open to suggestions.

I have been migrating these to data tables and one necessary step in the current approach is to find duplicated rows. When duplicated() is executed an error is returned stating that data tables must have keys. This is an unfortunate roadblock - other than setting keys, which isn't a universal solution and adds to computational costs, is there some other way to find duplicated objects?

Here is a reproducible example:

library(data.table)
set.seed(0)
x   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
y   <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))

res3    <- dt_intersect(x,y)

Yielding this error message:

Error in duplicated.data.table(z_rbind) : data table must have keys

The code works as-is for data frames, though I've named each function with the pattern dt_operation.

Is there some way to get around this issue? Setting keys only works for integers, which is a constraint I can't assume for the input data. So, perhaps I'm missing a clever way to use data tables?


Example set operation functions, where the elements of the sets are rows of data:

dt_unique       <- function(x){
    return(unique(x))
}

dt_union        <- function(x,y){
    z_rbind     <- rbind(x,y)
    z_unique    <- dt_unique(z_rbind)
    return(z_unique)
}

dt_intersect    <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixDupe      <- which(duplicated(z_rbind))
    z           <- z_rbind[ixDupe,]
    return(z)
}

dt_setdiff      <- function(x,y){
    zx          <- dt_unique(x)
    zy          <- dt_unique(y)

    z_rbind     <- rbind(zy,zx)
    ixRangeX    <- (nrow(zy) + 1):nrow(z_rbind)
    ixNotDupe   <- which(!duplicated(z_rbind))
    ixDiff      <- intersect(ixNotDupe, ixRangeX)
    diffX       <- z_rbind[ixDiff,]
    return(diffX)
}

Note 1: One intended use for these helper functions is to find rows where key values in x are not among the key values in y. This way, I can find where NAs may appear when calculating x[y] or y[x]. Although this usage allows for setting of keys for the z_rbind object, I'd prefer not to constrain myself to just this use case.

Note 2: For related posts, here is a post on running unique on data frames, with excellent results for running it with the updated data.table package. And this is an earlier post on running unique on data tables.

1

There are 1 answers

3
Matt Dowle On BEST ANSWER

duplicated.data.table needs the same fix unique.data.table got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.

But, on Note 1, there's a 'not join' idiom :

x[-x[y,which=TRUE]]

See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.


Update. Now in v1.8.3, not-join has been implemented.

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.