I'm trying to parse incoming e-mails and want to store the body as a UTF-8
encoded string in a database, however I've quickly noticed that not all e-mails send charset information in the Content-Type
header. After trying some manual quick fixes with String.force_encoding
and String.encode
I decided to ask the friendly people of SO.
To be honest I was secretly hoping for String.encoding
to automagically return the encoding used in the string, however it always appears ASCII-8BIT
after I sent a test e-mail to it. I started having this problem when I was implementing quoted-printable
as an option which seemed to work if I had also gotten some ;charset=blabla
info.
input = input.gsub(/\r\n/, "\n").unpack("M*").first
if( charset )
return input.force_encoding(charset).encode("utf-8")
end
# This is obviously wrong as the string is not always ISO-8859-1 encoded:
return input.force_encoding("ISO-8859-1").encode("utf-8")
I've been experimenting with several "solutions" i found on the internet, however most seemed to relate to file reading/writing, and experimented with a few gems for detecting encoding (however none really seemed to do the trick or were incredibly outdated). It should be possible, and it feels as if the answer is staring me right in the face, hopefully someone here will be able to shine some light on my situation and tell me what I've been doing completely wrong.
- using ruby 1.9.3
You may use https://github.com/janx/chardet to detect the origin encoding of you email text.
Example Here: