Decoding Amazon Reports in CP932 with Ruby

102 views Asked by At

Reports out of Amazon's SP-API are generally in UTF-8 except for the ones out of Japan, which are in CP932. I cannot seem to figure out how to decode these into usable data.

Running Ruby 3.1.2 and using the amz_sp_api gem for connecting with Amazon

For CSV reports we are doing:

data = AmzSpApi.inflate_document(content, report_document)
csv_string = CSV.generate do |csv|
  data.gsub("\r", "").split("\n").each do |line|
    csv << line.split("\t")
  end
end
csv_string.force_encoding 'ASCII-8BIT'
csv = CSV.parse(csv_string, headers: true)

Which doesn't complain about anything, but the resulting data looks something like:

...
"ship-state"=>"\xE7\xA6\x8F\xE5\xB2\xA1\xE7\x9C\x8C",

If I force the encoding to be 'CP932' then when I try to parse the csv I get:

3.1.2/lib/ruby/3.1.0/csv/parser.rb:786:in `build_scanner': Invalid byte sequence in Windows-31J in line 2. (CSV::MalformedCSVError)

For the XML reports we are using Nokogiri and doing something like this:

data = AmzSpApi.inflate_document(content, report_document)
parsed_xml = Nokogiri::XML(data)

The resulting xml is actually only part of the first node because it seems to silently fail.

In the above example data has:

data.encoding
=> #<Encoding:ASCII-8BIT>

You get the idea.

I obviously need to do SOMETHING to get all this to parse out properly but I am unclear what that something is.

I believe that perhaps the data is being converted to a string from a byte string, but that must be happening automatically behind the scenes

1

There are 1 answers

0
phil On

What doesn't work (but works for all Amazon reports in other regions that come down as UTF-8):

report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document)
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"

Output:

"Content is ASCII-8BIT"
"Data is ASCII-8BIT"
"We found 1 orders"

In the above, the xml will be malformed and not work (Hence the 1 order)

What works:

report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document).gsub("CP932", "UTF-8")
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"

Output:

=> "Content is ASCII-8BIT"
=> "Data is ASCII-8BIT"
=> "We found 151 orders"

The issue seems to be Nokogiri (and other online parsers I found) cannot handle that xml tag that says the encoding is CP932.

<?xml version="1.0" encoding="CP932"?>

The above code with gsub also works for UTF-8 files (because it does nothing)

NOTE: If you use HTTParty instead of Faraday the content encoding is UTF-8 instead of ASCII-8BIT but the issue (and solution) remains the same.