The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.
This way, we can perform attachment-level deduplication.
But how can the deduplication go further? Here are my thoughts:
- All attachments and emails could be compressed (byte-level deduplicated) individually of course.
- I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
- The MIME messages can also be compressed in sets.
- I am not concerned about search efficiency because there will be full text indexing used.
- Searching of the emails would of course use a type of full text indexing, that would not be compressed.
- Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.
Do you have any advice in this area? What is normal for an email storage system?
