I’m writing a script to read the subject lines on unread emails. My first attempt:
from imaplib import IMAP4_SSL
from email.parser import HeaderParser
# username =
# password =
# server =
# port =
M = IMAP4_SSL(server, port)
M.login(username, password)
M.select()
typ, data = M.search(None, '(UNSEEN)')
for num in data[0].split():
rv, data = M.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM)])')
header_data = data[0][1].decode('utf-8')
parser = HeaderParser()
msg = parser.parsestr(header_data)
subject = msg['Subject']
print(subject)
print()
This works for most emails, but it fails when there's a non-ascii character in the subject line. The output looks like:
=?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?=
So it looks like HeaderParser doesn't handle encodings (specified in RFC 1342). Looking at the documentation, it seemed like I needed to use decode_header
and make_header
. My second attempt:
# same setup code as before
from email.header import decode_header, make_header
for num in data[0].split():
rv, data = M.fetch(num, '(BODY.PEEK[HEADER.FIELDS (SUBJECT FROM)])')
headers_encoded = data[0][1].decode('latin-1')
#print(headers_encoded)
header_code_pairs = decode_header(headers_encoded)
#print(header_code_pairs)
headers = make_header(header_code_pairs)
parser = HeaderParser()
msg = parser.parsestr(str(headers))
subject = msg['Subject']
print(subject)
print()
And the output looks like this:
This email has ASCII only
This email has internatiònal characters From: Tester Testee <[email protected]>
For some reason it's concatenating the From field onto the second one. But it does decode the characters correctly! Both emails have the headers in the same order. When I uncomment the headers_encoded
and header_code_pairs
prints instead, I get this:
Subject: This email has ASCII only From: Tester Testee
[('Subject: This email has ASCII only\r\nFrom: Tester Testee <[email protected]>\r\n\r\n', None)]
Subject: =?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?= From: Tester Testee
[(b'Subject: ', None), (b'This email has internati\xc3\xb2nal characters', 'utf-8'), (b'From: Tester Testee <[email protected]>', None)]
So to me, this looks like the problem is being caused by the fact that in the international example, decode_header
misses a CRFL between the fields. So when make_header
reads it, it sees only one field.
I can work around this by separating the lines of the header before decoding, but am I missing something? Is there a better way?
None of the answers to this old question solved the problem with my example, so I'm posting it as my own question because I have code using make_header
that produces a different error. If you want to reproduce the error without using a real mailbox, you should be able to paste the following block into a text editor and have it load that instead of data[0][1]
Subject: =?UTF-8?Q?This_email_has_internati=C3=B2nal_characters?=
From: Tester Testee <[email protected]>