Encoding Error Using AWS S3 Select with the AWS SDK for Ruby

8.5k views Asked by At

I am trying to do the following:

  • download the output of an Athena query from S3 (file.csv)
  • gzip the output and upload to a different S3 location (file.csv.gz)
  • use S3 Select from within the Ruby SDK to query the contents of file.csv.gz

I always get the following error, always "near byte 8192", even when the contents of file.csv.gz are completely different:

Aws::S3::Errors::InvalidTextEncoding (UTF-8 encoding is required. The text encoding error was found near byte 8,192.)

NB: using the same S3 Select query against the same, uncompressed file.csv works as expected. I have tried all kinds of weird things but am full of despair.

Steps to reproduce:

  1. Start with file s3://mybucket/file.csv
  2. Download with aws-cli: aws s3 cp s3://mybucket/file.csv file.csv
  3. Gzip the file: gzip file.csv
  4. Upload the file.csv.gz: aws s3 cp file.csv.gz s3://mybucket/file.csv.gz

Here's the code:

class RunsS3SelectQueries
  def self.client
    @client ||= Aws::S3::Client.new
  end

  def self.run_query(sql:, bucket:, key:)
    data = ""
    handler = Aws::S3::EventStreams::SelectObjectContentEventStream.new
    handler.on_records_event do |event|
      puts "----records payload:----"
      payload = event.payload.read
      data += payload
    end
    handler.on_stats_event do |event|
       # get :stats event that contains progress information
       puts event.details.inspect
       # => Aws::S3::Types::Stats bytes_scanned=xx, bytes_processed=xx, bytes_returned=xx
    end
    params = {
      bucket: bucket,
      key: key,
      expression_type: "SQL",
      expression: sql,
      input_serialization: {
        csv: { file_header_info: "USE"}
      },
      output_serialization: {
        csv: {}
      },
      event_stream_handler: handler,
    }
    client.select_object_content(params)
    data
  end
end

The following receives the text encoding error.

output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv.gz')

However, running against the uncompressed file.csv does not:

output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv')

I've tried all kinds of combinations of text encodings, content-type metadata, content-encoding, etc., and can't seem to find anything that works. The fact that it always gets the error on byte 8192 is pretty weird/suspicious, in my opinion.

Any help would be much appreciated!

1

There are 1 answers

0
gnicholas On

You need to specify that the input is gzipped in input_serialization, otherwise s3 will try to decode the gzip header and will get an error about it not being valid utf-8 at byte 8192.

Something like the following will work:

input_serialization: { csv: { file_header_info: "USE"} CompressionType: "GZIP" }