R readr read_csv skip error with VROOM_CONNECTION_SIZE

1k views Asked by At

I have a large (~18gb) csv file that I would like to read in chuncks. The chuncks are separately processed (filtered) and concatenated. Since I'm iterating through several chunks I'm using the skip parameter of the read_csv function.

Here is an example of a chunk:

#Size of each chunk to be read and processed
chunk_size = 2000000
#In the full code in each loop the row_skip increases by the size of the chunk.
row_skip = 2000000 #initial
col_names = ['..','..']
chunk <- read_csv("G:/../data.csv", skip = row_skip,
                  n_max = chunk_size, col_names = col_names)

My problem is that if the skip parameter is sufficiently large (~500 000 +) I get the following error:

Error: The size of the connection buffer (131072) was not large enough to fit a complete line: Increase it by setting Sys.setenv("VROOM_CONNECTION_SIZE")

I have already tried to change it (Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10000)) but the problem still persists. Now the skip paramater has to increase as I iterate through the chunks, hence no matter of the initial chunk size, it will reach the threshold when it produces the error.

I thought it could be caused by insufficient RAM, but I can read a much larger initial chunk than the value of the skip when it produces the error (e.g. chunk_size = 5000000 works, while row_skip = 2000000 already generates the error).

0

There are 0 answers