Why is vroom so slow?

590 views Asked by At

I have a simple operation where I read several csvs, bind them, and then export, but vroom is performing much slower than other methods. I must be doing something wrong, but I'm not sure what, or why.

library(readr)
library(vroom)
library(data.table)
library(microbenchmark)

write_csv(mtcars, "test.csv")

microbenchmark(
  readr={
    t <- read_csv("test.csv", col_types=cols())
    write_csv(t, "test.csv")
  },data.tabl={
    t <- fread("test.csv")
    fwrite(t, "test.csv", sep=",")
  },vroom={
    t <- vroom("test.csv", delim=",", show_col_types = F)
    vroom_write(t, "test.csv", delim=",")
  },
  times=10
)
#> Unit: milliseconds
#>       expr       min        lq      mean    median        uq        max neval
#>      readr 12.636961 12.662955 15.865400 12.928211 13.503029  41.104583    10
#>  data.tabl  2.200815  2.275252  2.633456  2.342797  2.529283   4.830134    10
#>      vroom 57.376353 57.915135 64.280365 58.496847 58.966311 117.150837    10

Created on 2021-07-01 by the reprex package (v2.0.0)

1

There are 1 answers

0
jmcastagnetto On BEST ANSWER

To do a test with more data, I used the CSV from https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa, which contains 7.3+ million rows, and used a slight variation of your code:

library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
csv_file <- "vacunas_covid.csv.gz"
microbenchmark(
   readr={
     t <- read_csv(csv_file, col_types=cols())
     write_csv(t, csv_file)
   },data.table={
     t <- fread(csv_file)
     fwrite(t, csv_file, sep=",")
   },vroom={
     t <- vroom(csv_file, delim=",", show_col_types = F)
     vroom_write(t, csv_file, delim=",")
   },
   times=5
)

The results were:

Unit: seconds
       expr       min        lq      mean    median        uq       max neval  cld
      readr 101.72094 105.75384 109.16869 106.08111 108.06967 124.21788     5    c
 data.table  28.18751  30.32570  31.06592  30.44838  33.12746  33.24055     5  a
      vroom  48.65399  51.52445  55.78264  52.89823  53.83582  72.00071     5   b

From the results, vroom is at least 2x than readr using a big dataset, and data.table is ~1.7x faster than vroom. Perhaps the issue with the original example is that the data is small, and the indexing that vroom performs is contributing to the difference.

Just in case the code and results are at: https://gist.github.com/jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e