Using the data.table package in R, I am trying to create a cartesian product of two data.tables using the merge method as one would do in base R.

In base the following works:

#assume this order data
orders<-data.frame(date=as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01')),
               first.name=as.character(c('John','George','Henry')),
               last.name=as.character(c('Doe','Smith','Smith')),
               qty=c(10,50,6))

#and these dates
dates<-data.frame(date=seq(from=as.POSIXct('2012-08-28'),
                           to=as.POSIXct('2012-09-07'),by='day'))

#get the unique customers
cust<-unique(orders[,c('first.name','last.name')])

#using merge from base R, get the cartesian product
merge(dates,cust,by=integer(0))

However, the same technique does not work using data.table and this error is thrown:

"Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : A non-empty vector of column names for by is required."

#data.table approach
library(data.table)

orders.dt<-data.table(orders)

dates.dt<-data.table(dates)

cust.dt<-unique(orders.dt[,list(first.name,last.name)])

#try to use merge (data.table) in the same manner as base
merge(dates.dt,cust.dt,by=integer(0))

# Error in merge.data.table(dates.dt, cust.dt, by = integer(0)) : 
#   A non-empty vector of column names for `by` is required.

I want the result to reflect all customer names for all dates, just like in base, but do it in a data.table-centric way. Is this possible?

3 Answers

12
42- On Best Solutions

If you construct full names from the first and last in the dataframes, you can use CJ (cross-join). You cannot use all three vectors since there would be 99 items.

> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99

This returns a data.table object:

> CJ(dates$date,paste(cust$first.name, cust$last.name) )
            V1           V2
 1: 2012-08-28 George Smith
 2: 2012-08-28  Henry Smith
 3: 2012-08-28     John Doe
 4: 2012-08-29 George Smith
 5: 2012-08-29  Henry Smith
 6: 2012-08-29     John Doe
 7: 2012-08-30 George Smith
 8: 2012-08-30  Henry Smith
 9: 2012-08-30     John Doe
10: 2012-08-31     John Doe
11: 2012-08-31 George Smith
12: 2012-08-31  Henry Smith
13: 2012-09-01     John Doe
14: 2012-09-01 George Smith
15: 2012-09-01  Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02  Henry Smith
18: 2012-09-02     John Doe
19: 2012-09-03  Henry Smith
20: 2012-09-03     John Doe
21: 2012-09-03 George Smith
22: 2012-09-04  Henry Smith
23: 2012-09-04     John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05  Henry Smith
27: 2012-09-05     John Doe
28: 2012-09-06 George Smith
29: 2012-09-06  Henry Smith
30: 2012-09-06     John Doe
31: 2012-09-07 George Smith
32: 2012-09-07  Henry Smith
33: 2012-09-07     John Doe
            V1           V2
6
Josh O'Brien On

merge.data.table(x, y) is a convenience function that wraps a call to x[y], so the merge needs to be based on columns that are in both data.tables. (That's what that error message is trying to tell you).

One work-around is to add a dummy column to both data.tables, whose only purpose is to make the merge possible:

## Add a column "k", and append it to each data.table's vector of keyed columns.
setkeyv(cust.dt[,k:=1], c(key(cust.dt), "k"))
setkeyv(dates.dt[,k:=1], c(key(dates.dt), "k"))

## Merge and then remove the dummy column
res <- merge(dates.dt, cust.dt, by="k")
head(res[,k:=NULL])
#          date first.name last.name
# 1: 2012-08-28     George     Smith
# 2: 2012-08-28      Henry     Smith
# 3: 2012-08-28       John       Doe
# 4: 2012-08-29     George     Smith
# 5: 2012-08-29      Henry     Smith
# 6: 2012-08-29       John       Doe

## Maybe also clean up cust.dt and dates.dt    
# cust.dt[,k:=NULL]
# dates.dt[,k=NULL]
2
dnlbrky On

The solution from @JoshO'Brien uses merge but below is a similar alternative that does not (AFAIK).

If I understand the documentation in ?data.table::merge correctly, X[Y] should be slightly faster than data.table::merge(X,Y) (as of version 1.8.7). It refers to FAQ 2.12 to address this question, but the FAQ is a little confusing. First, the correct reference should be 1.12, not 2.12. And they don't indicate whether they are referring to the base version of merge or the data.table one, or both. So, this might just end up being a messier-looking solution that is equivalent, or it might be faster.

[Edit from Matthew] Thanks : now improved in v1.8.7 (?merge.data.table, FAQ 1.12 and added new FAQ 2.24)

DT_orders<-data.table(date=as.POSIXct(c('2012-08-28','2012-08-29','2012-08-29','2012-09-01')),
                      first.name=as.character(c('John','John','George','Henry')),
                      last.name=as.character(c('Doe','Doe','Smith','Smith')),
                      qty=c(10,2,50,6),
                      key="first.name,last.name")

# Note that I added a second record to the orders table for John Doe, to make sure it could handle duplicate first/last name combinations.

DT_dates<-data.table(date=seq(from=as.POSIXct('2012-08-28'),
                              to=as.POSIXct('2012-09-07'),by='day'),
                     key="date")

DT_custdates<-data.table(k=1,unique(DT_dates),key="k")[unique(DT_orders)[,list(k=1,first.name,last.name)]][,k:=NULL]