email.header not handling Finnish characters?

998 views Asked by At

A certain Python API returns u'J\xe4rvenp\xe4\xe4' for the finish word Järvenpää.

where \xe4 == ä

I then am calling email.header to add this field to a header to be printed.

email.header falls over when it tries to decode the umlaut:

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/email/header.py", line 73, in decode_header
    header = str(header)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)

I've tried a couple of things:

  • Addding # -*- coding: utf-8 -*- to the top of header.py
  • Calling unicode() on the Finnish string before passing it to email.header
  • Calling .encode('utf-8') on the Finnish string before passing it to email.header

None have solved the problem. What I am doing wrong? I'd imagine that a solution won't involve modifying header.py (a core Python module).

Python version: 2.7.10

UPDATE:

Header() is not being instantiated directly. Rather I'm callind the decode_header() function on the string:

email.Header.decode_header(theString)

It seems now that simply extend this thus:

email.Header.decode_header(theString.encode('utf-8'))

solves the problem

2

There are 2 answers

0
Klaus D. On BEST ANSWER

In order to have the email.header module handle encoding for you and create a proper header, you have to create an instance of email.header.Header with your string and the charset it should be encoded in:

>>> h = Header(text, charset)

For example:

>>> t = u'J\xe4rvenp\xe4\xe4'
>>> print t
Järvenpää
>>> from email.header import Header
>>> h = Header(t, 'utf-8')
>>> h
<email.header.Header instance at 0x7fc2636e7950>
>>> print h
=?utf-8?b?SsOkcnZlbnDDpMOk?=
>>> h = Header(t, 'iso-8859-1')
>>> print h
=?iso-8859-1?q?J=E4rvenp=E4=E4?=

The string can be either a unicode string or a byte string.

  • If you use a unicode string, the charset will only affect what encoding the header is encoded with.
  • If you use a byte string, the charset will both determine what encoding the byte string is assumed to be in, and what encoding will be used to encode the header. If the byte string you provide can't be decoded with that charset, an exception will be raised.
6
Alex Ivanov On

AFAIK, str() deals with ascii that's why you get an error. If your string is unicode you should do header = unicode(header), if not it should be decoded first.

#!/usr/bin/python
# -*- coding: utf-8 -*-

header = unicode("Järvenpää".decode('UTF-8'))
print header

Output

Järvenpää