I am trying to do the following:
- download the output of an Athena query from S3 (
file.csv
) - gzip the output and upload to a different S3 location (
file.csv.gz
) - use S3 Select from within the Ruby SDK to query the contents of
file.csv.gz
I always get the following error, always "near byte 8192", even when the contents of file.csv.gz
are completely different:
Aws::S3::Errors::InvalidTextEncoding (UTF-8 encoding is required. The text encoding error was found near byte 8,192.)
NB: using the same S3 Select query against the same, uncompressed file.csv
works as expected. I have tried all kinds of weird things but am full of despair.
Steps to reproduce:
- Start with file
s3://mybucket/file.csv
- Download with aws-cli:
aws s3 cp s3://mybucket/file.csv file.csv
- Gzip the file:
gzip file.csv
- Upload the
file.csv.gz
:aws s3 cp file.csv.gz s3://mybucket/file.csv.gz
Here's the code:
class RunsS3SelectQueries
def self.client
@client ||= Aws::S3::Client.new
end
def self.run_query(sql:, bucket:, key:)
data = ""
handler = Aws::S3::EventStreams::SelectObjectContentEventStream.new
handler.on_records_event do |event|
puts "----records payload:----"
payload = event.payload.read
data += payload
end
handler.on_stats_event do |event|
# get :stats event that contains progress information
puts event.details.inspect
# => Aws::S3::Types::Stats bytes_scanned=xx, bytes_processed=xx, bytes_returned=xx
end
params = {
bucket: bucket,
key: key,
expression_type: "SQL",
expression: sql,
input_serialization: {
csv: { file_header_info: "USE"}
},
output_serialization: {
csv: {}
},
event_stream_handler: handler,
}
client.select_object_content(params)
data
end
end
The following receives the text encoding error.
output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv.gz')
However, running against the uncompressed file.csv
does not:
output = RunsS3SelectQueries.run_query(sql: %q{SELECT * FROM S3Object }, bucket: 'mybucket', key: 'file.csv')
I've tried all kinds of combinations of text encodings, content-type metadata, content-encoding, etc., and can't seem to find anything that works. The fact that it always gets the error on byte 8192 is pretty weird/suspicious, in my opinion.
Any help would be much appreciated!
You need to specify that the input is gzipped in
input_serialization
, otherwise s3 will try to decode the gzip header and will get an error about it not being valid utf-8 at byte 8192.Something like the following will work:
input_serialization: { csv: { file_header_info: "USE"} CompressionType: "GZIP" }