I have a large (~18gb) csv file that I would like to read in chuncks. The chuncks are separately processed (filtered) and concatenated. Since I'm iterating through several chunks I'm using the skip
parameter of the read_csv
function.
Here is an example of a chunk:
#Size of each chunk to be read and processed
chunk_size = 2000000
#In the full code in each loop the row_skip increases by the size of the chunk.
row_skip = 2000000 #initial
col_names = ['..','..']
chunk <- read_csv("G:/../data.csv", skip = row_skip,
n_max = chunk_size, col_names = col_names)
My problem is that if the skip
parameter is sufficiently large (~500 000 +) I get the following error:
Error: The size of the connection buffer (131072) was not large enough to fit a complete line: Increase it by setting
Sys.setenv("VROOM_CONNECTION_SIZE")
I have already tried to change it (Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 10000)
)
but the problem still persists. Now the skip
paramater has to increase as I iterate through the chunks, hence no matter of the initial chunk size, it will reach the threshold when it produces the error.
I thought it could be caused by insufficient RAM, but I can read a much larger initial chunk than the value of the skip
when it produces the error (e.g. chunk_size = 5000000
works, while row_skip = 2000000
already generates the error).