Ruby (Rails) email (base64) gets split at diacritics and content lost in mysql

1.2k views Asked by At

I have a problem with my app that reads e-mails from external server using mailman gem (which is also using mail).

ruby 1.9.2p0
mail (2.3.0)
mailman (0.4.0) 
actionmailer (= 3.1.3)

database.yml

production:
  adapter: mysql2
  encoding: utf8

Here is a simple method to receive 'mail'. I build @message_body from text_part of multipart email (for ex. with attachments) or from the whole body (decoded).

def self.receive_mail(message)
    # some code here 
    @message_body = message.multipart? ? message.text_part.body.to_s : message.body.decoded
    # some code here, to save message in database

My problem is that if the message doesn't have attachments but have diacritics, like ą ś ł ń ż ź ó ... body is split just before first diacricit. So if body is: "test żłóbek test" I will get only "test " in @message_body.

My question is how to save such a message in an elegant way, so that text part is saved in database with all diacritics.

EDIT: to make it cleaner, I get e-mails that look like this one (it's just a part of e-mail sent from gmail)

--20cf307ac4372d830104c11c8cc6 Date: Mon, 28 May 2012 20:06:16 +0200 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: base64 Content-ID: <[email protected]_domain>

dGVzdCC/s7zm8bbzsSB0ZXN0Cg==

So we have this 'body' : dGVzdCC/s7zm8bbzsSB0ZXN0Cg==

After decoding we get : 'test \xbf\xb3\xbc\xe6\xf1\xb6\xf3\xb1 test\n'

And the problem is that starting from '\xbf' data is not saved in database.

UPDATE

another example, I think this is the problem here:

irb(main):008:0* require 'base64'
=> true
irb(main):009:0> a = "test źćłżąńś"
=> "test źćłżąńś"
irb(main):010:0> b = Base64.encode64(a)
=> "dGVzdCDFusSHxYLFvMSFxYTFmw==\n"
irb(main):011:0> Base64.decode64(b)
=> "test \xC5\xBA\xC4\x87\xC5\x82\xC5\xBC\xC4\x85\xC5\x84\xC5\x9B"

see, after decode64 my diacritics are LOST, what to do to get them back?

2

There are 2 answers

2
Frederick Cheung On
force_encoding('utf-8')

Doesn't work because the data isn't utf-8 - your mail headers clearly states that the message body is ISO 8859-2.

Mysql2 assumes everything is utf8 but can't convert the bytes to utf8 (because ruby doesn't know the original encoding) so your non ascii characters are thrown away by mysql

For that one string you could try

body.force_encoding('ISO-8859-2').encode('utf-8')

But really you want to be working out what encoding to use from the content type header. I'm surprised the mail gem isn't doing that for you

0
januszm On

I have the solution. Concatenation of

.force_encoding("ORIGINAL_CHARSET").encode("UTF-8")

methods on E-Mail body object is the solution.

I had to change my receive_mail() definition from previous 'one liner' to:

if message.multipart?
    charset = message.text_part.content_type_parameters[:charset]
    @message_body = message.text_part.body.to_s.force_encoding(charset).encode("UTF-8")
else
    charset = message.content_type_parameters[:charset]
    @message_body = message.body.decoded.force_encoding(charset).encode("UTF-8")
end

With this construct I can detect what was the charset of original e-mail and then force it and encode back to UTF-8. This ensures proper decoding from base64 from original to utf-8.

If anyone has more elegant solution, please share.