I have a larger than memory arrow dataset created by open_dataset() from partitioned parquet files that I need to use distinct(.keep_all = TRUE) on. I need to keep the computation on disk, therefore I'm using arrow to speed things up and not crash my R session.
I want to keep rows that have distinct values in columns a and b.
Example
df <- tibble(a = c(1,1,2,2),
b = c(1,1,2,1),
c = c("x", "y", "z", "a")) %>%
arrow_table()
df %>%
distinct(a, b, .keep_all = TRUE)
This results in:
Error: distinct() with .keep_all = TRUE not supported in Arrow
Desired Output
An arrow dataset with the following values.
a b c
<dbl> <dbl> <chr>
1 1 1 x
2 2 2 z
3 2 1 a
I see others have had similar questions but it doesn't seem like arrow plans to incorporate use of .keep_all (see closed issue).
base::duplicated() would also work but it's not supported by arrow either. Any thoughts on how to work around this without using collect() (which crashes my R session)? TIA!
As commented, a grouped
sliceoperation will likely do instead.Created on 2024-03-28 with reprex v2.1.0