R arrow query extremely slow first time, fast thereafter?

105 views Asked by At

I'm going through this tutorial of how to use arrow instead of regular dplyr. Second and subsequent times I run queries on this dataset it's very fast, but the first one is incredibly slow (times below).

Minimal Reproducible Example

Here's an MRE of what I'm doing. This first bit is to obtain the ~70gb parquet dataset (takes ~6 hours, depending on internet connection).

library(tidyverse)
library(arrow)

# copy_files(
#   from = s3_bucket("ursa-labs-taxi-data-v2"),
#   to = "~/Datasets/nyc-taxi"
# )

then, in preparation:

nyc_taxi <- open_dataset("~/Datasets/nyc-taxi")

# small file to join to later
nyc_taxi_zones <- read.csv(url("https://raw.githubusercontent.com/djnavarro/arrow-user2022/main/data/taxi_zone_lookup.csv")) %>% 
  janitor::clean_names()

airport_zones <- nyc_taxi_zones %>% 
  filter(str_detect(zone, "Airport")) %>% 
  pull(location_id)
# [1]   1 132 138


# Alter schema (otherwise joining int32 and int64 cols won't work) 
nyc_taxi_zones2 <- nyc_taxi_zones %>% 
  transmute(
    dropoff_location_id = location_id,
    dropoff_borough = borough,
    dropoff_zone = zone
  ) %>% 
  as_arrow_table(
    schema = schema(
      dropoff_location_id = int64(),
      dropoff_borough = utf8(),
      dropoff_zone = utf8()  
    )
  )

and finally, the actual operation:

start_time <- Sys.time()
nyc_taxi %>% 
  filter(
    pickup_location_id %in% airport_zones
  ) %>% 
  select(
    matches("datetime"),
    matches("location_id")
  ) %>% 
  left_join(
    nyc_taxi_zones2
  ) %>% 
  count(dropoff_zone) %>% 
  arrange(desc(n)) %>% 
  collect()

end_time <- Sys.time()
end_time - start_time
  • First time running this on a fresh ec2: 17.7 minutes!
  • Second time running it in the same R session, 11 seconds!
  • Note that after closing and reopening RStudio, it's still 11 seconds.

What I've tried so far

  • Thinking the bottleneck might be an I/O constraint, I switched to an IO optimized ec2, upgraded SSD and a few other things. They didn't make any substantial difference.
  • This possibly related question/answer looks like some promising leads (it's in python, but perhaps the root cause (not specifying partitioning) seems like a possible explanation for this problem.
0

There are 0 answers