dplyr::tbl() returning 0 rows from BigQuery external tables

65 views Asked by At

I'm using bigrquery and dplyr/dbplyr to connect and interact with tables on Google BigQuery. I've noticed that calling dplyr::tbl() or DBI::dbReadTable() on a managed table correctly returns a preview of the first 10 rows, but running the same on an external table returns a 0 row preview (albeit with the correct schema).

Is there a way to get external tables to work with dplyr::tbl() the way that managed tables do? Or is this a limitation of BQ external tables noting that you "cannot run a BigQuery job that exports data from an external table"?

# These return 0 rows
dplyr::tbl(bq_con, 'mtcars_ext')
#> Job complete
#> Billed: 0 B
#> # Source:   table<mtcars_ext> [0 x 11]
#> # Database: BigQueryConnection
#> # ℹ 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

DBI::dbReadTable(bq_con, 'mtcars_ext')
#> # A tibble: 0 × 11
#> # ℹ 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

# Verifying that this table has data in it
DBI::dbGetQuery(bq_con, str_glue('SELECT * FROM {project_id}.ds_edw.mtcars_ext'))
#> Job complete
#> Billed: 0 B
#> Downloading first chunk of data.
#> First chunk includes all requested rows.
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  2  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#>  [...]

I've tested this with the same results across multiple accounts, as well as bigrquery 1.4.2, 1.4.2.9; dbplyr 2.2.1, 2.3.0, 2.4.0; DBI 1.1.3.

It appears that the external table does return rows once compute() is run over it (at least in dbplyr 2.2.1). I need to be able to chain dplyr verbs on the tbl() object so I've opted to use a no-op function to force the table to compute. I suspect that both of these methods create a BQ temporary table which is the workaround that Google recommends for external tables in their documentation (linked above). However, this means that running the modified tbl() call has an additional cost attached to it, whereas the same call on managed tables is not billed.

# This works in dbplyr 2.2.1 but not 2.4.0
tbl(bq_con, str_glue('{project_id}.ds_edw.mtcars_ext')) %>% 
  mutate(dummy_col = 1) %>% 
  select(-dummy_col)
#> Job complete
#> Billed: 0 B
#> Job complete
#> Billed: 10.49 MB
#> Downloading first chunk of data.
#> First chunk includes all requested rows.
#> # Source:   SQL [?? x 11]
#> # Database: BigQueryConnection
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 2  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> [...]

# Compared to 0 cost with managed table
# NB: this result has not been cached in BQ prior

tbl(bq_con, 'mtcars_mgd')
#> Job complete
#> Billed: 0 B
#> # Source:   table<mtcars_mgd> [?? x 11]
#> # Database: BigQueryConnection
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  2  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#>  [...]
0

There are 0 answers