Deduplication Suggestions for Email Storage

Question

Deduplication Suggestions for Email Storage

326 views Asked by 700 Software At 25 January 2012 at 14:57

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.

This way, we can perform attachment-level deduplication.

But how can the deduplication go further? Here are my thoughts:

All attachments and emails could be compressed (byte-level deduplicated) individually of course.
I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
The MIME messages can also be compressed in sets.
I am not concerned about search efficiency because there will be full text indexing used.
Searching of the emails would of course use a type of full text indexing, that would not be compressed.
Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.

Do you have any advice in this area? What is normal for an email storage system?

Original Q&A

There are 1 answers

**Filip Hanes** · Answer 1 · 2022-01-29T12:35:11+00:00

decode all base64 mime parts, not only attachments
calculate secure hash of its content
replace part with reference in email body, or create custom header with list of extracted mime parts
store in blob storage under secure hash (content addresable storage)
use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/@andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
or store each reference relation hash-emailid in db
carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
store encoding parameters (folds, tail) in reference in email body for exact reconstruction
compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
wav files can be compressed using flac, etc.
content-type is sender specified, same attachment can have different content-types
quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
small mime parts may not be worth extracting before specific number of duplicities are present in system

TechQA.

Deduplication Suggestions for Email Storage

There are 1 answers

Related Questions in EMAIL

Related Questions in STORAGE

Related Questions in DEDUPLICATION

Popular Questions

Popular Tags

Trending Questions