mime body guess charset (and convert to UTF-8)

Question

mime body guess charset (and convert to UTF-8)

537 views Asked by CharlesLeaf At 27 May 2012 at 20:55

I'm trying to parse incoming e-mails and want to store the body as a UTF-8 encoded string in a database, however I've quickly noticed that not all e-mails send charset information in the Content-Type header. After trying some manual quick fixes with String.force_encoding and String.encode I decided to ask the friendly people of SO.

To be honest I was secretly hoping for String.encoding to automagically return the encoding used in the string, however it always appears ASCII-8BIT after I sent a test e-mail to it. I started having this problem when I was implementing quoted-printable as an option which seemed to work if I had also gotten some ;charset=blabla info.

input = input.gsub(/\r\n/, "\n").unpack("M*").first
if( charset )
  return input.force_encoding(charset).encode("utf-8")
end

# This is obviously wrong as the string is not always ISO-8859-1 encoded:
return input.force_encoding("ISO-8859-1").encode("utf-8")

I've been experimenting with several "solutions" i found on the internet, however most seemed to relate to file reading/writing, and experimented with a few gems for detecting encoding (however none really seemed to do the trick or were incredibly outdated). It should be possible, and it feels as if the answer is staring me right in the face, hopefully someone here will be able to shine some light on my situation and tell me what I've been doing completely wrong.

using ruby 1.9.3

Original Q&A

There are 2 answers

red-o-alf On 09 March 2013 at 19:12

Have you tried https://github.com/fac/cmess ?

== DESCRIPTION

CMess bundles several tools under its hood that aim at dealing with various problems occurring in the context of character sets and encodings. Currently, there are:

guess_encoding:: Simple helper to identify the encoding of a given string. Includes the ability to automatically detect the encoding of an input.

[...]

**Hooopo** · Accepted Answer · 2012-05-29T09:54:14+00:00

You may use https://github.com/janx/chardet to detect the origin encoding of you email text.

Example Here:

irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'UniversalDetector'
=> false
irb(main):003:0> p UniversalDetector::chardet('hello')
{"encoding"=>"ascii", "confidence"=>1.0}
=> nil

TechQA.

mime body guess charset (and convert to UTF-8)

There are 2 answers

Related Questions in RUBY

Related Questions in UTF-8

Related Questions in MIME

Related Questions in QUOTED-PRINTABLE

Popular Questions

Popular Tags

Trending Questions