I am working on the R code, trying to read the content of the file through API call. The content is base64 encoded and the file itself is over 2GB big.
I have tried few approaches by now, downloading the file in memory, writing it to disk. In both cases when I try to decode the file it fails with:
Error in readLines: R character strings are limited to 2^31-1 bytes
Has anyone faced this in the past and knows how to work this out? At first I tried the call with httr, then switched to httr2:
req <- request(test_datapull_API) |>
req_headers('Content-Type' = 'application/json', 'Cookie' = sprintf('token=%s; username=%s', token, input$user_id)) |>
req_body_json(call_body)
tmp <- tempfile()
req_perform(
req,
path = tmp
)
stringvalue <- readLines(tmp) #this is where it fails with the mentioned error
# decode the base64 string to binary (raw) data
b64_result <- base64_decode(stringvalue)
I am running this on Docker, I have more than 50GB's there, no limit per container. I don't believe this would be an R memory issue.
Anyone?
I tried the download and decoding with httr, httr2 and base64 and openssl libs.
I've had to do something similar in the past - here's a slightly adapted version.
Iterating over lines
The important thing is to get the number of lines in the file without opening the entire file. This means you know in advance how long your output list is going to be so you don't face problems with growing a list. The way to do this differs depending on your OS.
Then the function to read the file in chunks is straightforward:
To test it we can create a temporary text files with, for example, 903 lines:
Then read it back in:
base64 encoded data
Note this also works if we convert the string to json and then base 64 encode it. This is because by default base 64 wraps lines after 76 characters.
However the line breaks are not meaningful with this approach, you'll have to look at the structure of your json data to work out where to put the chunks back together.
Putting base64 encoded json back together
This will depend on the format your exact data but in the absence of that I will replicate a base64 encoded version of
mtcarsa thousand times as an example:This is basically json array of rows in the following format:
We can write a function to parse these rows. The important thing is to pass the last, incomplete line of each one to the next chunk, where we prepend it.
Then loop over the chunks, passing the final incomplete line to next chunk:
This will piece back together the json chunks.