Find and replace in a aws.s3 object during json-streamin

54 views Asked by At

I have a fairly practical question, where it's hard to provide a regex - sorry for that. So I try to explain it properly.

A script connects to a AWS s3 bucket with the aws.s3 package. In that bucket there are .gz-files which contain JSON. Unfortunately some lines - not all - contain a bug in JSON-Format. They end with }]]} instead of }]}.

So I try to find an R-way to find and replace the pattern before unpacking the JSON-Object fails. A non-working line is already inserted (# gsub()) which represents a lucky guess to fix that thing.

What would be your solution?

    data_i <- aws.s3::get_object(
  object = objectname_i,
  bucket = bucketname_i,
  region = "eu-central-1",
  as = "raw"
) %>%
  rawConnection() |> 
  gzcon() |> 
 # gsub("}]]}", "}]]}") |>  
  jsonlite::stream_in() 
1

There are 1 answers

0
Andi On BEST ANSWER

I found following solution: After setting up a connection, I use gzcon() for unpacking - as before. Now I read in the lines (readLines()) over the connection and have the data in R.

Now I can operate on the R object with gsub().

After that I want to use stream_in() again, and open therefore a textConnection(). As a result I have the data.frame s3ObjectDataframe

   s3ObjectUnpacked <- aws.s3::get_object(
      object = objectname_i,
      bucket = bucketname_i,
      region = "eu-central-1",
      as = "raw"
    ) |> 
      rawConnection() |>
      gzcon()

    s3ObjectPerLine <- readLines(s3ObjectUnpacked)
    s3ObjectCorrected <- gsub("}]]}", "}]}", s3ObjectPerLine)
    s3ObjectDataframe <- jsonlite::stream_in(textConnection(gsub("\\n", "", s3ObjectCorrected)))