inner_join() NA values on sqlite tibbles

149 views Asked by At

I've encountered a problem with doing a natural join on sqlite tibbles (tbl_sql objects) containing same-named columns that have NA values (or missing values, I suppose) in them.

library(DBI)
library(dplyr)
library(dbplyr)

## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))

## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")

modcars %>% head
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

tbl_modcars %>% head
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

Notice that there is a difference in the output of inner joining these two tables with themselves. This is due to a difference on how dplyr and sqlite treat missing values.

inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1

Effectively, I want the same rows when inner_join() is invoked on modcars data.frames to appear when inner_join() is invoked on the tbl_modcars tibbles.

I realize that I could simply use the following code to obtain the desired output:

joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

However, this ignores the join on any non-NA information in the NAs column (if that were applicable). Also, I'd rather perform just one dplyr invocation than two (parser stack overflow may become an issue if too many are invoked).

Any solution or clarification is appreciated.

0

There are 0 answers