So far I used bigrquery's query_exec
to download timeseries data from BigQuery.
sql <- "SELECT Date, val1, val2
FROM `mydata`
WHERE DATE(_PARTITIONTIME) BETWEEN '2020-05-01' AND '2020-06-01'"
project <- "myproj"
df <- query_exec(sql, project = project, max_pages = Inf, use_legacy_sql = FALSE) %>% as_tibble()
Since the last update a warning appears indicating that query_exec
is deprecated and instead bq_table_download
in conjuction with bq_project_query
should be used.
tb <- bq_project_query(project, sql)
df <- bq_table_download(tb, page_size = 100000)
Adjusting my code resulted in a dataframe of the same size (more than 4 million rows) as doing the request with query_exec
. However, from row ~80000 onwards now only dates of format 1970-01-01 appear and the remaining columns are either empty or contain zeros. Using the old way with query_exec
still works and results in the correctly formatted dataframe.
Any ideas what could be the problem here?
This is most likely related to the page_size parameter that you set to 100000. If this is increased to bigger numbers the results are not properly parsed anymore and some NAs or wrongly parsed errors appear. I assume that you dates come out as 1970-01-01 because of that.
Try setting the page_size to something closer to the default 10000 and it should work. I have not yet found the perfect value, but 20000 works fine for me.