Why is my merge returning more rows than I expected?

55 views Asked by At

First, I'm very sorry for posting such a question. However, I'm a little bit confused about getting unexpected output. I want to merge two data frames.

x <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                vr1=c(1,2,3,4,5,6,7,8,9))
y <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                vr2=c(1,2,3,4,5,6,7,8,9))
z <- merge(x, y, by="id")

I'm expecting the output would be the following:

  id vr1 vr2
1  1   1   1
2  1   2   2
3  1   3   3
4  2   4   4
5  2   5   5
6  2   6   6
7  3   7   7
8  3   8   8
9  3   9   9

However, I got the following. What is my mistake?

   id vr1 vr2
1   1   1   1
2   1   1   2
3   1   1   3
4   1   2   1
5   1   2   2
6   1   2   3
7   1   3   1
8   1   3   2
9   1   3   3
10  2   4   4
11  2   4   5
12  2   4   6
13  2   5   4
14  2   5   5
15  2   5   6
16  2   6   4
17  2   6   5
18  2   6   6
19  3   7   7
20  3   7   8
21  3   7   9
22  3   8   7
23  3   8   8
24  3   8   9
25  3   9   7
26  3   9   8
27  3   9   9
3

There are 3 answers

2
G. Grothendieck On BEST ANSWER

Each id of 1 in x matches each id 1 in y so the merge generates 9 rows for id 1.

If there are the same number of rows as in the question and the id values correspond as in the question the cbind(x, y) will give the desired result shown.

cbind(x, y)
##   id vr1 id vr2
## 1  1   1  1   1
## 2  1   2  1   2
## 3  1   3  1   3
## 4  2   4  2   4
## 5  2   5  2   5
## 6  2   6  2   6
## 7  3   7  3   7
## 8  3   8  3   8
## 9  3   9  3   9

Another possibility is that you want to match the ith id 1 in x to just the ith id 1 in y. (For other examples you may need to use the all= argument of merge but for this example all values of that argument give the same result.) In this case

x2 <- transform(x, seq = ave(id, id, FUN = seq))
y2 <- transform(y, seq = ave(id, id, FUN = seq))

merge(x2, y2, by = c("id", "seq"))
##   id seq vr1 vr2
## 1  1   1   1   1
## 2  1   2   2   2
## 3  1   3   3   3
## 4  2   1   4   4
## 5  2   2   5   5
## 6  2   3   6   6
## 7  3   1   7   7
## 8  3   2   8   8
## 9  3   3   9   9
1
proffarnsworth On

I think there is a little ambiguity in the question as it is posed. Do you only want to bind columns if the ID is the same and the values are different? Like if you had

x <- data.frame(id=c(1,1,1),
                vr1=c(1,1,1))
y <- data.frame(id=c(1,1,1),
                vr2=c(1,1,1)

Should it output

id vr1 vr2
1   1   1
1   1   1
1   1   1

or

id vr1 vr2 
1   1   1

My solution is to just rename the column in y and then use cbind. Rename it back after.

colnames(y)[2] <- "vr1"
df <- cbind(x, y[,"vr1", drop = FALSE])
colnames(df)[3] <- "vr2"
df

> df
  id vr1 vr2
1  1   1   1
2  1   2   2
3  1   3   3
4  2   4   4
5  2   5   5
6  2   6   6
7  3   7   7
8  3   8   8
9  3   9   9
0
Elin On

There are a few issues going on.

  1. By default merge() only returns those rows that are in both. In this case you only have 1,2,3 as ID in the two data frames so only rows with those values are returned.

  2. Each x is matched with a y with the same ID. So since you have e.g. 3 ID=1 rows in x that will be matched with the one row that has ID=1 in the y. (x is the left data frame, y is the right).