Ruby `CSV.read` error invalid byte sequence in UTF-8 (ArgumentError)

2.3k views Asked by At

First of all This is not a duplicate of this SO question here .I have a csv file encoded in Shift-JIS this is my script to parse the file

require 'csv'
str1 = '社員番号'
str2 = 'メールアドレス'
str1.force_encoding("Shift_JIS").encode!
str2.force_encoding("Shift_JIS").encode!
file=File.open("SyainInfo.csv", "r:Shift_JIS")
csv = CSV.read(file, headers: true)
p csv[str1]
p csv [str2]

but even after specifying enconding, I am getting invalid byte sequence in UTF-8 (ArgumentError) . Any thoughts? My ruby is 2.3.0

1

There are 1 answers

8
Stefan On BEST ANSWER

First of all, your encoding doesn't look right:

'社員番号'.force_encoding("Shift_JIS").encode!
#=> "\x{E7A4}\xBE\x{E593}\xA1\x{E795}\xAA\x{E58F}\xB7"

force_encoding takes the bytes from str1 and interprets them as Shift JIS, whereas you probably want to convert the string to Shift JIS:

'社員番号'.encode('Shift_JIS')
#=> "\x{8ED0}\x{88F5}\x{94D4}\x{8D86}"

Next, you can pass a filename to CSV.read, so instead of:

file = File.open(filename)
CSV.read(file)

You can just write:

CSV.read(filename)

That said, you could either work with Shift JIS encoded strings:

require 'csv'
str1 = '社員番号'.encode("Shift_JIS")
str2 = 'メールアドレス'.encode("Shift_JIS")
csv = CSV.read('SyainInfo.csv', encoding: 'Shift_JIS', headers: true)
csv[str1]
csv[str2]

Or – and that's what I would do – you could work with UTF-8 strings by specifying a second encoding:

require 'csv'
str1 = '社員番号'
str2 = 'メールアドレス'
csv = CSV.read('SyainInfo.csv', encoding: 'Shift_JIS:UTF-8', headers: true)
csv[str1]
csv[str2]

encoding: 'Shift_JIS:UTF-8' instructs CSV to read Shift JIS data and transcode it to UTF-8. It's equivalent to passing 'r:Shift_JIS:UTF-8' to File.open