Outer join data.tables, avoiding duplicate columns

63 views Asked by At

I have two tables like this, where most ids are shared, but both contain ids that aren't in the other table. The years (columns) also overlap but have differences:

table1 <- data.table(id=c("a101","a102","a103","a104","a105"),
                     year1=c(0,10,12,414,23),
                     year2=c(4,23,34,45,23))

table2 <- data.table(id=c("a102","a103","a104","a105","a106"),
                     year2=c(23,34,45,23,0),
                     year3=c(14,2,14,12,15))

I want to outer join them so I end up with a table like this:

table3 <- data.table(id=c("a101","a102","a103","a104","a105","a106"),
                     year1=c(0,10,12,414,23,NA),
                     year2=c(4,23,34,45,23,0),
                     year3=c(NA,14,2,14,12,15))

This sounds simple, but my real tables have many year columns, and I can't find a solution that doesn't create duplicate columns e.g. year2.x, year2.y, year3.x, year3.y etc, and doesn't require me to write out all the column names in the code.

My current solution is below, but it seems very convoluted, and probably not very memory efficient. If there are solutions that allow joining by reference that would be ideal as the tables I'm working with are very large. However, I realise this might not be possible for outer joins.

# Full join the tables
table3 <- merge(table1, table2, by = "id", all = TRUE)

# Coalesce duplicate columns
for (col in intersect(names(table1), names(table2))) {
  if (col != "id") {
    table3[[col]] <- fcoalesce(table3[[paste0(col, ".x")]], table3[[paste0(col, ".y")]])
  }
}

# Remove duplicate columns
table3 <- table3[, !grepl("\\.x|\\.y", names(table3)), with = FALSE]
0

There are 0 answers