I have recently run into an EML file I wanted to parse with Python email module.
In from
header, there was following text:
From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrp?=
=?utf-8?b?g6g=?=" <[email protected]>
So the name is encoded in 2 parts. When I concatenate the code and decode this manually to hex, I get the following result, which is correct UTF-8 string:
e5 bd ad e4 bb a5 e5 9b bd 2f e7 ac ac e4 ba 8c e4 ba 8b e4 b8 9a e9 83 a8 e9 a1 b9 e7 9b ae e9 83 a8 2f e7 ac ac e4 ba 8c e4 ba 8b e4 b8 9a e9 83 a8
However, when I call the Python email Parser parse
, the last 3 bytes are not decoded correctly. Instead, when I read the values of message['from']
, there are surrogates:
dce9:20:dc83:dca8
So when I, for example, want to print the string, it ends up with
UnicodeEncodeError('utf-8', '彭以国/第二事业部项目部/第二事业\udce9\udc83\udca8', 17, 18, 'surrogates not allowed')
When I join the 2 encoded parts in From
header into one, which looks like this:
From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrpg6g=?=" <[email protected]>
The string is decoded correctly by the library and can be printed just fine.
Is this a bug inside Python email module? Is the double-encoded value even permitted by EML standard?
Here is a sample EML file + Python code to reproduce the bad decoding (this does not actually trigger the exception, which happens later i.e. with SQLAlchemy not being able to encode the string back to UTF-8)
EML:
Content-Type: multipart/mixed; boundary="===============2193163039290138103=="
MIME-Version: 1.0
Date: Wed, 25 Aug 2018 19:21:23 +0100
From: "=?utf-8?b?5b2t5Lul5Zu9L+esrOS6jOS6i+S4mumDqOmhueebrumDqC/nrKzkuozkuovkuJrp?=
=?utf-8?b?g6g=?=" <[email protected]>
Message-Id: <[email protected]>
Subject: Sample subject
To: [email protected]
--===============2193163039290138103==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
VGhpcyBpcyBhIHNhbXBsZSB0ZXh0
--===============2193163039290138103==--
Python code:
from email.parser import Parser
from email import policy
from sys import argv
with open(argv[1], 'r', encoding='utf-8') as eml_file:
msg = Parser(policy=policy.default).parse(eml_file)
print(msg['from'])
Result:
彭以国/第二事业部项目部/第二事业� ��
This appears to be a problem with how the
email.parser
infrastructure is handling unfolding of multi-line headers containing encoded-word tokens for the From header and other structured headers. It does this correctly for unstructured headers such asSubject
.Your header has two encoded word parts, on two separate lines. This is perfectly normal, an encoded-word token has limited space (there is a maximum length limit) and so your UTF-8 data was split into two such words, and there is a line-separator plus space in-between. All great and fine. Whatever generated the email was wrong to split in the middle of a UTF-8 character (RFC2047 states that is strictly forbidden), a decoder of such data should not insert spaces between the decoded bytes. It is the extra space that then prevents the
email
header handling from joining the surrogates and repairing the data.So this appears to be a bug in the way the headers are parsed when handling structured headers; the parser does not correctly handle spaces between encoded words, here the space was introduced by the folded header line. This then results in the space being preserved in between the two encoded-word parts, preventing proper decoding. So while RFC2047 does state that encoded-word sections MUST contain whole characters (multi-byte encodings must not be split), it also states that encoded words can be split up with CRLF SPACE delimiters and any spaces in between encoded words are to be ignored.
You can work around this by supplying a custom policy class, which removes the leading white space from lines in your own implementation of the
Policy.header_fetch_parse()
method.and use that as your policy when loading:
Demo:
I filed Python issue #35547 to track this.