Handling international characters in email subject lines with Python 3

415 views Asked by At

I’m writing a script to read the subject lines on unread emails. My first attempt:

from imaplib import IMAP4_SSL
from email.parser import HeaderParser

# username = 
# password = 
# server = 
# port = 

M = IMAP4_SSL(server, port)
M.login(username, password)
M.select()
typ, data = M.search(None, '(UNSEEN)')

for num in data[0].split():
    rv, data = M.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM)])')
    header_data = data[0][1].decode('utf-8')
    parser = HeaderParser()
    msg = parser.parsestr(header_data)
    subject = msg['Subject']
    print(subject)
    print()

This works for most emails, but it fails when there's a non-ascii character in the subject line. The output looks like:

=?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?=

So it looks like HeaderParser doesn't handle encodings (specified in RFC 1342). Looking at the documentation, it seemed like I needed to use decode_header and make_header. My second attempt:

# same setup code as before

from email.header import decode_header, make_header

for num in data[0].split():
    rv, data = M.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM)])')
    headers_encoded = data[0][1].decode('latin-1')
    #print(headers_encoded)
    header_code_pairs = decode_header(headers_encoded)
    #print(header_code_pairs)
    headers = make_header(header_code_pairs)
    parser = HeaderParser()
    msg = parser.parsestr(str(headers))
    subject = msg['Subject']
    print(subject)
    print()

And the output looks like this:

This email has ASCII only

This email has internatiònal characters From: Tester Testee <[email protected]>

For some reason it's concatenating the From field onto the second one. But it does decode the characters correctly! Both emails have the headers in the same order. When I uncomment the headers_encoded and header_code_pairs prints instead, I get this:

Subject: This email has ASCII only From: Tester Testee

[('Subject: This email has ASCII only\r\nFrom: Tester Testee <[email protected]>\r\n\r\n', None)]

Subject: =?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?= From: Tester Testee

[(b'Subject: ', None), (b'This email has internati\xc3\xb2nal characters', 'utf-8'), (b'From: Tester Testee <[email protected]>', None)]

So to me, this looks like the problem is being caused by the fact that in the international example, decode_header misses a CRFL between the fields. So when make_header reads it, it sees only one field.

I can work around this by separating the lines of the header before decoding, but am I missing something? Is there a better way?

None of the answers to this old question solved the problem with my example, so I'm posting it as my own question because I have code using make_header that produces a different error. If you want to reproduce the error without using a real mailbox, you should be able to paste the following block into a text editor and have it load that instead of data[0][1]

Subject: =?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?=
From: Tester Testee <[email protected]>
0

There are 0 answers