Ruby CSV BOM|UTF-8 encoding for StringIO

5.5k views Asked by At

Ruby 2.6.3.

I have been trying to parse a StringIO object into a CSV instance with the bom|utf-8 encoding, so that the BOM character (undesired) is stripped and the content is encoded to UTF-8:

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns true

Apparently the bom|utf-8 encoding does not work for StringIO objects, but I found that it does work for files, for instance:

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns false

Considering that I need to work with StringIO directly, why does CSV ignores the bom|utf-8 encoding? Is there any way to remove the BOM character from the StringIO instance?

Thank you!

3

There are 3 answers

0
3limin4t0r On

Ruby 2.7 added the set_encoding_by_bom method to IO. This methods consumes the byte order mark and sets the encoding.

require 'csv'
require 'stringio'

CSV_READ_OPTIONS = { headers: true }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom

first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false
0
Max On

Ruby doesn't like BOMs. It only handles them when reading a file, never anywhere else, and even then it only reads them so that it can get rid of them. If you want a BOM for your string, or a BOM when writing a file, you have to handle it manually.

There are probably gems for doing this, though it's easy to do yourself

if string[0...3] == "\xef\xbb\xbf"
  string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
  string = string[2..-1].force_encoding('UTF-16LE')
# etc
0
Pak On

I found out that forcing encoding to utf8 on the StringIO string and removing the BOM to generate a new StringIO worked:

require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF") # => false  

The encoding option is no more needed. It may not be the best option memory-wise, but it works.