Discrepancy between "query_exec" and "bq_table_download" using bigrquery

363 views Asked by At

So far I used bigrquery's query_exec to download timeseries data from BigQuery.

sql <- "SELECT  Date, val1, val2
        FROM `mydata` 
        WHERE DATE(_PARTITIONTIME) BETWEEN '2020-05-01' AND '2020-06-01'"

project <- "myproj"

df <- query_exec(sql, project = project, max_pages = Inf, use_legacy_sql = FALSE) %>% as_tibble()

Since the last update a warning appears indicating that query_exec is deprecated and instead bq_table_download in conjuction with bq_project_query should be used.

tb <- bq_project_query(project, sql)
df <- bq_table_download(tb, page_size = 100000)

Adjusting my code resulted in a dataframe of the same size (more than 4 million rows) as doing the request with query_exec. However, from row ~80000 onwards now only dates of format 1970-01-01 appear and the remaining columns are either empty or contain zeros. Using the old way with query_exec still works and results in the correctly formatted dataframe.

Any ideas what could be the problem here?

1

There are 1 answers

2
arohland On

This is most likely related to the page_size parameter that you set to 100000. If this is increased to bigger numbers the results are not properly parsed anymore and some NAs or wrongly parsed errors appear. I assume that you dates come out as 1970-01-01 because of that.

Try setting the page_size to something closer to the default 10000 and it should work. I have not yet found the perfect value, but 20000 works fine for me.