Deduplication Suggestions for Email Storage

345 views Asked by At

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.

This way, we can perform attachment-level deduplication.

But how can the deduplication go further? Here are my thoughts:

  • All attachments and emails could be compressed (byte-level deduplicated) individually of course.
  • I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
  • The MIME messages can also be compressed in sets.
  • I am not concerned about search efficiency because there will be full text indexing used.
  • Searching of the emails would of course use a type of full text indexing, that would not be compressed.
  • Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.

Do you have any advice in this area? What is normal for an email storage system?

1

There are 1 answers

0
Filip Hanes On
  1. decode all base64 mime parts, not only attachments
  2. calculate secure hash of its content
  3. replace part with reference in email body, or create custom header with list of extracted mime parts
  4. store in blob storage under secure hash (content addresable storage)
  5. use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/@andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
  6. or store each reference relation hash-emailid in db
  7. carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
  8. store encoding parameters (folds, tail) in reference in email body for exact reconstruction
  9. compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
  10. jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
  11. wav files can be compressed using flac, etc.
  12. content-type is sender specified, same attachment can have different content-types
  13. quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
  14. be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
  15. small mime parts may not be worth extracting before specific number of duplicities are present in system