I need to join two data.frame
s in R
, and both of them have two id candidates, but lots of missing data (so I can't just pick one and filter out the rest).
EDIT reproducible example of both datasets, and desired outcome:
messy1 <- data.frame(id1 = c(11, NA, 13, 11, 12, NA), id2 = c(NA, 22, 23, 21, NA, 23), time = rep(1:2, each = 3))
messy2 <- data.frame(id1 = c(12, NA, 14, 14), id2 = c(22, 23, 24, NA), time = c(1, 1, 1, 2))
desired1 <- data.frame(id1 = rep(11:13, 2), id2 = rep(21:23, 2), time = rep(1:2, each = 3)) # 12/22 is only available in messy2
desired2 <- data.frame(id1 = c(12:13, 14, 14), id2 = c(22:23, 24, 24), time = c(rep(1, 3), 2)) # 13/23 is only available in messy1
desired_joined <- full_join(desired1, desired2)
> messy1
id1 id2 time
1 11 NA 1
2 NA 22 1
3 13 23 1
4 11 21 2
5 12 NA 2
6 NA 23 2
> messy2
id1 id2 time
1 12 22 1
2 NA 23 1
3 14 24 1
4 14 NA 2
> desired_joined
id1 id2 time
1 11 21 1
2 12 22 1
3 13 23 1
4 11 21 2
5 12 22 2
6 13 23 2
7 14 24 1
8 14 24 2
FINAL EDIT
the example above now accurately reflects the problem I was trying to solve. After figuring this out, the comment provided by @Ben is the actual solution.
Obsolete chunks of the question have been removed to avoind further confusing anyone that bumps into this in the future.
As I pointed out in my latest edit to the question, the solution is to do a
fill()
. What I found, in the actual RWD is that it's better to do a double-fill withid1
andid2
, in each of the datasets before joining. Anyone who needs to do something similar may use the following steps:Note: My actual datasets have ~ 65k rows each, but if I
join
before thefill
I end up with 2.5M rows (the reason eludes me). By doing all thesefill
s I end up with a much more reasonable 73k rows.