Rails ActiveRecord string field encoding vs Ruby String encoding

1.4k views Asked by At

Context: Transcoding a string from an external source for saving in the database

From a gem, I get a string s that has latin-1-encoded content and that I want to store in a Rails model.

r = MyRecord.new(mystring: s)
# ...
r.save

Because my PostgreSQL database uses UTF-8 encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:

ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xdf 0x65
...

I can solve this easily by transcoding the string:

r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
# ...
r.save

(Because r.encoding returns #<Encoding:ASCII-8BIT> instead of #<Encoding:ISO-8859-1>, I'm passing the source encoding as the second argument. The gem that produced s probably isn't aware that the file it read the string from is latin1 encoded.)

Challenge: Avoid hard-coding the destination encoding

It occurred to me, that knowledge about the database's string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.

I can ask the model's class for the database's encoding:

MyRecord.connection.encoding

This doesn't return a Ruby Encoding object though, it returns a string containing the encoding's name. Fortunately, the Encoding class can be queried with names (and some aliases) to look up encodings:

Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8

Unfortunately, different naming conventions are used: MyRecord.connection.encoding returns 'UTF8' (no minus sign) while Encoding.find(...) needs to be passed 'UTF-8' (with minus sign) or 'CP65001' if we want it to return #<Encoding:UTF-8>.)

Sooooo close.

Question: Is there a clean and/or recommended way

to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database's encoding for that?

Discarded ideas

I don't feel doing string manipulation or pattern matching on the result of MyRecord.connection.encoding or on the contents of Encoding.aliases() would be any better than just leaving the hard-coded values in the code.

Modifying Encoding.aliases()'s return value doesn't have any effect:

Encoding.aliases['UTF8'] = 'UTF-8'
Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8

(and doesn't feel right either, anyway), nor does modifying the return value of #names:

Encoding::UTF_8.names.push('UTF8')
Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8

I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.

1

There are 1 answers

0
Denis Washington On BEST ANSWER

The simplest and, arguably, cleanest solution to this problem would be to not call Encoding.find directly, but have an utility method (perhaps in a module located at lib/yourapp) which knows about the encoding name differences you care about and falls back to Encoding.find for all other inputs:

module YourApp
  module DatabaseStringEncoding
    def find(name)
      case name
      when 'UTF8'
        Encoding::UTF_8
      ...
      else
        Encoding.find(name)
      end 
    end
  end

This is both easy to understand and discover (as opposed to modifying Encoding directly, which is not visible to the reader of the code which does the encoding). Based on such a find method, you could then go further and implement a method which automatically recodes a string to the database's string encoding using YourRecord.connection.encoding.

I know it would be more exciting to get Encoding.find to do exactly what you want, but I would argue that this "dumber" approach would actually be the better one. :-)