R arrow query extremely slow first time, fast thereafter?

105 views Asked by stevec At 27 March 2024 at 10:25

I'm going through this tutorial of how to use arrow instead of regular dplyr. Second and subsequent times I run queries on this dataset it's very fast, but the first one is incredibly slow (times below).

Minimal Reproducible Example

Here's an MRE of what I'm doing. This first bit is to obtain the ~70gb parquet dataset (takes ~6 hours, depending on internet connection).

library(tidyverse)
library(arrow)

# copy_files(
#   from = s3_bucket("ursa-labs-taxi-data-v2"),
#   to = "~/Datasets/nyc-taxi"
# )

then, in preparation:

nyc_taxi <- open_dataset("~/Datasets/nyc-taxi")

# small file to join to later
nyc_taxi_zones <- read.csv(url("https://raw.githubusercontent.com/djnavarro/arrow-user2022/main/data/taxi_zone_lookup.csv")) %>% 
  janitor::clean_names()

airport_zones <- nyc_taxi_zones %>% 
  filter(str_detect(zone, "Airport")) %>% 
  pull(location_id)
# [1]   1 132 138


# Alter schema (otherwise joining int32 and int64 cols won't work) 
nyc_taxi_zones2 <- nyc_taxi_zones %>% 
  transmute(
    dropoff_location_id = location_id,
    dropoff_borough = borough,
    dropoff_zone = zone
  ) %>% 
  as_arrow_table(
    schema = schema(
      dropoff_location_id = int64(),
      dropoff_borough = utf8(),
      dropoff_zone = utf8()  
    )
  )

and finally, the actual operation:

start_time <- Sys.time()
nyc_taxi %>% 
  filter(
    pickup_location_id %in% airport_zones
  ) %>% 
  select(
    matches("datetime"),
    matches("location_id")
  ) %>% 
  left_join(
    nyc_taxi_zones2
  ) %>% 
  count(dropoff_zone) %>% 
  arrange(desc(n)) %>% 
  collect()

end_time <- Sys.time()
end_time - start_time

First time running this on a fresh ec2: 17.7 minutes!
Second time running it in the same R session, 11 seconds!
Note that after closing and reopening RStudio, it's still 11 seconds.

What I've tried so far

Thinking the bottleneck might be an I/O constraint, I switched to an IO optimized ec2, upgraded SSD and a few other things. They didn't make any substantial difference.
This possibly related question/answer looks like some promising leads (it's in python, but perhaps the root cause (not specifying partitioning) seems like a possible explanation for this problem.

Original Q&A

TechQA.

R arrow query extremely slow first time, fast thereafter?

Minimal Reproducible Example

What I've tried so far

There are 0 answers

Related Questions in R

Related Questions in AMAZON-EC2

Related Questions in DPLYR

Related Questions in APACHE-ARROW

Related Questions in DBPLYR

Popular Questions

Trending Questions