I'm working on the data for a clinical trial. In this data, there are multiple observations that refer to the same patient and they have varying degrees of missing data. I have four variables which I'm using as identifiers and I've decided that, if a patient has identical values for three out of the four variables, that he/she should be considered a duplicate.
The degree of missing data in all of these variables is extremely low, so there's not an issue of a patient having three NA's and being considered a duplicate of another one with three NA's.
I've been able to remove the duplicates by creating the combinations of variables and using the dplyr::distinct() . Unfortunately, this leads to one of the observations being kept, which might have missing data that an observation for the same patient had, but was excluded. I'll demonstrate with some fake data:
library(tidyverse)
library(knitr)
tibble(Name = c("Pedro", "Pedro","Pedro","Cristina","Walter","Sara","Mateus"),
Birth = dmy(c("29/07/1994","29/07/1994","29/07/1994","01/04/1960", "22/12/1956", "20/02/1997","25/07/1994")),
CNS = c("700",NA,"700","701","702","703","704"),
Document = c("104","104",NA,"105","106","107","108"),
SystolicBP1 = c(NA,NA,120,160,152,114,NA),
DiastolicBP1 = c(NA,NA,80,100,100,92,NA),
SystolicBP2 = c(NA,NA,NA,148,NA,NA,100),
DiastolicBP2= c(NA,NA,NA,90,NA,NA,82),
HBA1c = c(7,7,7,8.2,8,9,6.5)) -> dt
kable(dt)
| Name | Birth | CNS | Document | SystolicBP1 | DiastolicBP1 | SystolicBP2 | DiastolicBP2 | HBA1c |
|---|---|---|---|---|---|---|---|---|
| Pedro | 1994-07-29 | 700 | 104 | NA | NA | NA | NA | 7.0 |
| Pedro | 1994-07-29 | NA | 104 | NA | NA | NA | NA | 7.0 |
| Pedro | 1994-07-29 | 700 | NA | 120 | 80 | NA | NA | 7.0 |
| Cristina | 1960-04-01 | 701 | 105 | 160 | 100 | 148 | 90 | 8.2 |
| Walter | 1956-12-22 | 702 | 106 | 152 | 100 | NA | NA | 8.0 |
| Sara | 1997-02-20 | 703 | 107 | 114 | 92 | NA | NA | 9.0 |
| Mateus | 1994-07-25 | 704 | 108 | NA | NA | 100 | 82 | 6.5 |
I can then run the following to get rid of the duplicates, but I end up losing some data:
identifiers <- c("Name","Birth","CNS","Document")
combn(identifiers, 3) -> comb
kable(comb)
Name Name Name Birth Birth Birth CNS CNS CNS Document Document Document
## this takes the result of distinct() for each
## combination of the 4 identifiers, taken 3 at a time
dt %>%
distinct (!!! syms(comb[,1]), .keep_all = TRUE) %>%
distinct (!!! syms(comb[,2]), .keep_all = TRUE) %>%
distinct (!!! syms(comb[,3]), .keep_all = TRUE) %>%
distinct (!!! syms(comb[,4]), .keep_all = TRUE) %>%
kable()
Name Birth CNS Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c Pedro 1994-07-29 700 104 NA NA NA NA 7.0 Cristina 1960-04-01 701 105 160 100 148 90 8.2 Walter 1956-12-22 702 106 152 100 NA NA 8.0 Sara 1997-02-20 703 107 114 92 NA NA 9.0 Mateus 1994-07-25 704 108 NA NA 100 82 6.5
My goal is, when faced with missing data, to try to get the missing data from the duplicate observations. If I could do this, in the example above the first observation would've gotten the values for "SystolicBP1" and "DiastolicBP1" from the third observation in the original tibble.
We could create strings out of the first four ident-columns and run
hclustbased on string distances fromadistto getids. Next, usingby, we split the data along theseids and reduce to one row, leaving either first value orNA.Ordep is considered as a duplicate of Pedro, and Laura is considered as a duplicate of Ana, which is fine according to your statements. As tuning parameter in
cl_fun()I usehfromcutree(), which specifies the height if you plothc, you could also trykinstead, which would specify the number of clusters with the disadvantage, though, thatkis an integer and less flexible. You could also specify more arguments ofadist()such ascoststo fine-tune. In thish=8.0produced the wanted result.Data:
Looks like: