How to reconstruct conversations or group emails?

184 views Asked by At

I am having a PST files which contains the email history of a user. The task is to read this PST file and reconstruct the email history to display it in a client. This includes the correctly displaying of conversations as you know it from Email clients:

Meeting at 8:00               07:34 am
  AW: Meeting at 8:00         09:12 am
    AW: AW: Meeting at 8:00   13:45 pm
[Jenkins Build] Success       11:54 am
  [Jenkins Build] Failed      12:13 pm
    [Jenkins Build] Success   01:12 pm
[Jenkins Build] Success       10:34 am
  [Jenkins Build] Failed      12:12 pm
    [Jenkins Build] Success   05:12 pm

However, I don't know how I could do this reliably.

I am using java-libpst (see Official Documentation) which provides a PSTMessage object. There is a method getConversationId() but that appears to be just a string of the original subject of that message which means that there might be duplicates (e.g. [Jenkins Build]*).

So, I am not sure how Outlook is able to reconstruct conversations and whether this is trivial but if there is actually a simple method to do this which I am just overlooking I'd be happy if somebody would let me know - otherwise this will end up in me parsing a ton of subject fields, parsing them and trying to match emails by their subject with the danger of missing different conversations which just have the same subject coincidentally.

1

There are 1 answers

0
Carl G On

I think you will need to construct the conversations yourself. You might find the source code referenced on this page about the Netscape Mail message threading algorithm helpful.

I copied the source code to Github. Here's the email Threader.java file.

Here is someone offering an explanation of how Gmail constructs conversations My gist is:

  1. Emails coming after an email with an equivalent subject, from any of the participants in any previous email, are part of the same conversation.
  2. The in-reply-to email field can create participants to an email conversation even if they weren't an explicit participant.

Where:

equivalent subject means either an identical subject, or a subject that would result replying or forwarding. I.e. "FW: X", "RE: X", "Fwd: X", etc.

explicit participants in an email: the sender or any email appearing in a TO: or CC: field. (Maybe a BCC: field too...)

participants in an email: explicit participants in an email or anyone who has sent a later email using the in-reply-to field.

participants in any previous email: the distinct emails that are participants in email with an earlier send date having equivalent subject to a current email.

Here's another exposition of email fields relevant to email threading. What I took from this is that the References header should also be consulted in addition to the in-reply-to header, and that it is more reliable. (Maybe, if present, it should supercede the in-reply-to header.