Context: Transcoding a string from an external source for saving in the database
From a gem, I get a string s
that has latin-1
-encoded content and that I want to store in a Rails model.
r = MyRecord.new(mystring: s)
# ...
r.save
Because my PostgreSQL database uses UTF-8
encoding, saving the model after setting its string field to the string causes an error when that string contains certain non-ASCII characters:
ActiveRecord::StatementInvalid: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x65
...
I can solve this easily by transcoding the string:
r = MyRecord.new(mystring: s.encode(Encoding::UTF_8, Encoding::ISO_8859_1))
# ...
r.save
(Because r.encoding
returns #<Encoding:ASCII-8BIT>
instead of #<Encoding:ISO-8859-1>
, I'm passing the source encoding as the second argument. The gem that produced s
probably isn't aware that the file it read the string from is latin1
encoded.)
Challenge: Avoid hard-coding the destination encoding
It occurred to me, that knowledge about the database's string encoding does not belong in the part of the code where I do this persisting and thus also the transcoding.
I can ask the model's class for the database's encoding:
MyRecord.connection.encoding
This doesn't return a Ruby Encoding
object though, it returns a string containing the encoding's name. Fortunately, the Encoding
class can be queried with names (and some aliases) to look up encodings:
Encoding.find 'UTF-8' # returns #<Encoding:UTF-8>, the value of Encoding::UTF_8
Unfortunately, different naming conventions are used: MyRecord.connection.encoding
returns 'UTF8'
(no minus sign) while Encoding.find(...)
needs to be passed 'UTF-8'
(with minus sign) or 'CP65001'
if we want it to return #<Encoding:UTF-8>
.)
Sooooo close.
Question: Is there a clean and/or recommended way
to avoid the hard-coding of the destination encoding and instead dynamically determine and use the the database's encoding for that?
Discarded ideas
I don't feel doing string manipulation or pattern matching on the result of MyRecord.connection.encoding
or on the contents of Encoding.aliases()
would be any better than just leaving the hard-coded values in the code.
Modifying Encoding.aliases()
's return value doesn't have any effect:
Encoding.aliases['UTF8'] = 'UTF-8'
Encoding.find 'UTF8' # ArgumentError: unknown encoding name - UTF8
(and doesn't feel right either, anyway), nor does modifying the return value of #names
:
Encoding::UTF_8.names.push('UTF8')
Encoding.find 'UTF8'# ArgumentError: unknown encoding name - UTF8
I guess both only return dynamically generated collections or copies of the underlying collections, and for a good reason.
The simplest and, arguably, cleanest solution to this problem would be to not call
Encoding.find
directly, but have an utility method (perhaps in a module located atlib/yourapp
) which knows about the encoding name differences you care about and falls back toEncoding.find
for all other inputs:This is both easy to understand and discover (as opposed to modifying
Encoding
directly, which is not visible to the reader of the code which does the encoding). Based on such afind
method, you could then go further and implement a method which automatically recodes a string to the database's string encoding usingYourRecord.connection.encoding
.I know it would be more exciting to get
Encoding.find
to do exactly what you want, but I would argue that this "dumber" approach would actually be the better one. :-)