I've encountered a problem with doing a natural join on sqlite tibbles (tbl_sql
objects) containing same-named columns that have NA values (or missing values, I suppose) in them.
library(DBI)
library(dplyr)
library(dbplyr)
## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))
## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")
modcars %>% head
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
tbl_modcars %>% head
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
Notice that there is a difference in the output of inner joining these two tables with themselves. This is due to a difference on how dplyr and sqlite treat missing values.
inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
Effectively, I want the same rows when inner_join()
is invoked on modcars
data.frames
to appear when inner_join()
is invoked on the tbl_modcars
tibbles.
I realize that I could simply use the following code to obtain the desired output:
joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
However, this ignores the join on any non-NA information in the NAs
column (if that were applicable). Also, I'd rather perform just one dplyr invocation than two (parser stack overflow may become an issue if too many are invoked).
Any solution or clarification is appreciated.