Email parsing test dataset

780 views Asked by At

I am evaluating email parsing libraries for an Elixir/Erlang project and am trying to figure out which one is "best", or if I should build my own. The criteria I am using for "best" is: which library is the most RFC compliant.

The problem I am facing is that (unsurprisingly) each library has its own tests, so If I want to compare apples-to-apples I need to run them against the same tests.

Is there a collection of test emails available that I can use for evaluation? Or am I better off to copy tests from a more active Java/Ruby/Python library?

2

There are 2 answers

0
jstedfast On

I have a collection of mbox's that I use for testing mime parsers.

https://github.com/jstedfast/MimeKit/tree/master/UnitTests/TestData/mbox

That link is a directory containing a few *.mbox.txt files and their equivalent summary files (which is just some metadata about each message that should be easy to get from the message once the parser has parsed it from the mbox).

There's also some *.html files which are just the extracted html message bodies that are used for testing the logic for figuring out which body part is the actual message body. You can probably ignore that as it's not really about rfc-compliance.

The main mbox to look at and use is the jwz.mbox.txt file - that's the mbox file I got from Jamie Zawinski of Netscape Mail fame back in the early 2000's for testing Netscape Mail's parser.

simple.mbox.txt is a very short mbox of 3 messages with nested multiparts using different sets of boundary markers. The second and third message are the 2 that are most likely to break parsers (the first might break random mime parsers written by newbies on sourceforge or github, but nothing seriously written). THe second message has all nested multiparts using boundary="x" which will break parsers that don't use a boundary stack. The third message has nested multiparts that all use an empty string boundary (e.g. boundary="").

Then there's a content-length.mbox.txt for testing that the parser properly handles Content-Length headers.

unmunged.mbox.txt looks like it was accidentally committed - looks like I wrote that to test to see what Thunderbird did with Content-Length headers and unmunged From lines?

Anyway, to see how I generated the output for the summary files, you can check out https://github.com/jstedfast/MimeKit/blob/master/UnitTests/MimeParserTests.cs#L624

Methods like DumpMimeTree, etc are all listed above that method in the file.

I've got a very similar test suite for my C MIME parser as well (if you'd rather read C than C#): https://github.com/jstedfast/gmime/blob/master/tests/test-parser.c

Additional Thoughts:

One thing to keep in mind when evaluating MIME parsers is that you don't really want strict rfc-compliance in parsing because that means that a lot of messages will fail to parse. What you really want is a library that will handle as much brokenness as possible while outputting new messages that strictly conform to the rfcs (as much as possible anyway).

While those mbox files should be helpful in making sure that the parsers you test are at least robust enough to handle those, that's not necessarily the end-all of testing.

One of the next things I do when evaluating a MIME parser is to check how the parser parses address headers. Does it do something stupid like splitting the header value on ,'s? If so, it's out. I would probably say it had better use a tokenizer approach or it's probably not even worth considering.

The same goes for rfc2047 decoding.

Here's a rant I wrote back in 2013 when I was in your position looking for a reasonably good MIME parser for C#/.NET: https://jeffreystedfast.blogspot.com/2013/09/time-for-rant-on-mime-parsers.html

This links back to an earlier post I had written which is a rant about why decoding headers (rfc2047) is hard to get right: https://jeffreystedfast.blogspot.com/2013/08/why-decoding-rfc2047-encoded-headers-is.html

I guess the problem with trying to evaluate a MIME parser/email library is that you kind of need to be intimately familiar with the specifications in order to have much confidence in trying to evaluate them beyond the simple "can it parse my random set of messages?"

I hope that this has been helpful, but... yea, if your experience is anything like mine was back in 2013 looking for a decent C# parser, you're going to need to write your own - just please, please, please read and follow the specs if you do because otherwise you just end up giving other email devs nightmares.

0
Marcos Tapajós On

I don't think you will find any complete test suite for e-mail parsing in Elixir, but it would be a very nice project to work on.

If I'm going to start a project like that, I would probably pick tests for any library, evaluate how complete it is (based on the RFC) and build a generic way to run that against any library.

DockYard/elixir-mail/blob/master/test/mail/parsers/rfc_2822_test.exs can be a good start point for you.