I have a simple operation where I read several csvs, bind them, and then export, but vroom
is performing much slower than other methods. I must be doing something wrong, but I'm not sure what, or why.
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
write_csv(mtcars, "test.csv")
microbenchmark(
readr={
t <- read_csv("test.csv", col_types=cols())
write_csv(t, "test.csv")
},data.tabl={
t <- fread("test.csv")
fwrite(t, "test.csv", sep=",")
},vroom={
t <- vroom("test.csv", delim=",", show_col_types = F)
vroom_write(t, "test.csv", delim=",")
},
times=10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readr 12.636961 12.662955 15.865400 12.928211 13.503029 41.104583 10
#> data.tabl 2.200815 2.275252 2.633456 2.342797 2.529283 4.830134 10
#> vroom 57.376353 57.915135 64.280365 58.496847 58.966311 117.150837 10
Created on 2021-07-01 by the reprex package (v2.0.0)
To do a test with more data, I used the CSV from https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa, which contains 7.3+ million rows, and used a slight variation of your code:
The results were:
From the results,
vroom
is at least 2x thanreadr
using a big dataset, anddata.table
is ~1.7x faster thanvroom
. Perhaps the issue with the original example is that the data is small, and the indexing thatvroom
performs is contributing to the difference.Just in case the code and results are at: https://gist.github.com/jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e