Encoding Error Using AWS S3 Select with the AWS SDK for Ruby

Question

Encoding Error Using AWS S3 Select with the AWS SDK for Ruby

8.5k views Asked by Sam in Oakland At 03 January 2025 at 22:58

I am trying to do the following:

download the output of an Athena query from S3 (file.csv)
gzip the output and upload to a different S3 location (file.csv.gz)
use S3 Select from within the Ruby SDK to query the contents of file.csv.gz

I always get the following error, always "near byte 8192", even when the contents of file.csv.gz are completely different:

Aws::S3::Errors::InvalidTextEncoding (UTF-8 encoding is required. The text encoding error was found near byte 8,192.)

NB: using the same S3 Select query against the same, uncompressed file.csv works as expected. I have tried all kinds of weird things but am full of despair.

Steps to reproduce:

Start with file s3://mybucket/file.csv
Download with aws-cli: aws s3 cp s3://mybucket/file.csv file.csv
Gzip the file: gzip file.csv
Upload the file.csv.gz: aws s3 cp file.csv.gz s3://mybucket/file.csv.gz

Here's the code:

class RunsS3SelectQueries
  def self.client
    @client ||= Aws::S3::Client.new
  end

  def self.run_query(sql:, bucket:, key:)
    data = ""
    handler = Aws::S3::EventStreams::SelectObjectContentEventStream.new
    handler.on_records_event do |event|
      puts "----records payload:----"
      payload = event.payload.read
      data += payload
    end
    handler.on_stats_event do |event|
       # get :stats event that contains progress information
       puts event.details.inspect
       # => Aws::S3::Types::Stats bytes_scanned=xx, bytes_processed=xx, bytes_returned=xx
    end
    params = {
      bucket: bucket,
      key: key,
      expression_type: "SQL",
      expression: sql,
      input_serialization: {
        csv: { file_header_info: "USE"}
      },
      output_serialization: {
        csv: {}
      },
      event_stream_handler: handler,
    }
    client.select_object_content(params)
    data
  end
end

The following receives the text encoding error.

output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv.gz')

However, running against the uncompressed file.csv does not:

output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv')

I've tried all kinds of combinations of text encodings, content-type metadata, content-encoding, etc., and can't seem to find anything that works. The fact that it always gets the error on byte 8192 is pretty weird/suspicious, in my opinion.

Any help would be much appreciated!

Original Q&A

There are 1 answers

**gnicholas** · Answer 1 · 2019-01-04T17:20:05+00:00

You need to specify that the input is gzipped in input_serialization, otherwise s3 will try to decode the gzip header and will get an error about it not being valid utf-8 at byte 8192.

Something like the following will work:

input_serialization: { csv: { file_header_info: "USE"} CompressionType: "GZIP" }

TechQA.

Encoding Error Using AWS S3 Select with the AWS SDK for Ruby

There are 1 answers

Related Questions in AWS-SDK-RUBY

Related Questions in AMAZON-S3-SELECT

Popular Questions

Popular Tags

Trending Questions