How to manage version history for Many-To-Many relationship using deltas in Hibernate?

1.3k views Asked by At

We are working on a system where a person can send a document to another person, a document can have multiple attachments, as given below.

Document {
       Set<Attachment> attachments;
}

If X sends a document (Doc1,Ver1) to Y and Y edits the document then we have to create a new version (Doc1,Ver2) of the document so that X sent box doesn't reflect the changes made by Y.

In our system there will be millions of documents where each document can have hundreds of attachments. Also a document can travel through n number of persons over a period of time.

We should be able to fetch any document version along with the list of attachments it had at that time, so I have to maintain versions of document, so the table structure that came to my mind immediately was the one below.

Document - id primary key, ver_id primary key
Attachment - id, doc_id foreign key, doc_ver_id foreign key

But the problem with this structure is that if X is sending a document to Y with 100 attachment and Y has made some minor modification then I have to create a new version along with copying all the attachments for the new version, most of which are same as the one in previous version, since we will be having millions of documents and each document will move through n number of persons this model will result in a very huge attachment table with lot of redundant data.

So we thought of an alternate structure for attachment, as below.

Document - id primary key, ver_id primary key
Attachment - id, doc_id, attached_ver_id, detached_version_id

But I couldn't create a hibernate entity for Document with this structure, so my question is, is there any other table structure which is better equipped to solve this problem with less redundancy and is it possible to create a hibernate mapping for the above table structure.

3

There are 3 answers

8
Steve Chambers On BEST ANSWER

Background

When data history needs to be kept there are usually two possible approaches:

Method #1: Cloning

When a new entry is created, its details are copied from the most recent existing entry.

Method #2: Deltas

The details for the very first entry are stored. Each subsequent entry stores the changes from the previous version.

Pros/Cons:

Method #1 is generally simpler and faster as the details for any record can be looked up directly without needing to build them up. But Method #2 uses less storage. (Probably worth noting that in my experience Method #1 has always been preferable because simplicity and speed of retrieval is usually more of an important consideration than storage).

What is being asked?

My understanding is you started off with Method #1 but now prefer Method #2.

Answer

But I couldn't create a hibernate entity for Document with this structure, so my question is, is there any other table structure which is better equipped to solve this problem with less redundancy and is it possible to create a hibernate mapping for the above table structure.

It should be perfectly possible to create entities for this database structure - each entity is of course just the Hibernate class representation of a database table. Would suggest adding a mapping table between Document Version and Attachment:

Document - id primary key, ver_id primary key, ...
Attachment - id primary key, ...
DocumentAttachmentDelta - doc_id, doc_ver_id, attachment_id, added_or_removed_flag

Here DocumentAttachmentDelta is an explicitly defined mapping table with @ManyToOne relationships linking to the primary keys identifying the document version and attachment. It has an additional boolean flag that specifies whether the attachment is being removed or added for this version of the document. So from above, for the first version of the document all its initial attachments would be added but for subsequent versions only the deltas would be stored, which could be additions or removals.

Entity details (following a request in the comments)

@Entity
class Document {
    /* ...other fields... */

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "document", orphanRemoval = true)
    List<DocumentAttachmentDelta> documentAttachmentDeltas;
}

@Entity
class Attachment {
    /* ...other fields... */

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "attachment", orphanRemoval = true)
    List<DocumentAttachmentDelta> documentAttachmentDeltas;
}

@Entity
class DocumentAttachmentDeltas {
    /* ...other fields... */

    @ManyToOne
    Document document;

    @ManyToOne
    Attachment attachment;
}
0
fhossfel On

Or you define the relationship between Document and Attachment as @ManyToMany on both sides and ensure with a callback on the Java side that all the Attachment's documents have the same id and only differ in ver_id.

1
Nathan On

If I really wanted to use deltas, I would use the following model.

@Entity
public class Document {

    @Id
    private String id;

    @Lob
    private byte [] firstVersion; 

    @OneToMany(mappedBy = "document")
    private final Set<Attachment> attachments = Sets.newHashSet();
}

Along with an attachment entity as follows

@Entity
@Table(uniqueConstraints = {
        @UniqueConstraint(columnNames = { "document_id", "version" })
})
@Check(constraints = "(delta is null and previous_version_id is null) or (delta is not null and previous_version_id is not null)")
public class Attachment {

    @Id
    private Long id;

    @Column(nullable = false, name = "version")
    private Long version;

    @Lob
    @Column(name = "delta")
    private byte [] delta;

    @JoinColumn(name = "document_id")
    @ManyToOne(optional = false)
    private Document document;

    @JoinColumn(name = "previous_version_id")
    @ManyToOne(optional = true)
    private Attachment previousVersion;

}

This way, you have the original version of the document as created by the user. Then, each attachment references the previous version of the document, reflecting the changes in the delta field. When a user sends a version of the document (which is really just an attachment) to a user, you add many-to-many between your person/user entity and the Attachment. In this way, the sent version of the document can be reconstructed, along with all of its predecessors.

Since there can only be one initial version of the document, I would consider the possibility of a partial unique constraint (partial unique index in Postgres, for example) to enforce that there is only one attachment per document that has no delta and no parent. This cannot be modeled in Hibernate, however.

The check-constraint enforces that the version with no parent also has no delta, since it is exactly the firstVersion of the document contained within the Document entity. In this model, the "version" field is not strictly necessary, but could be useful when you want to have a name for a particular version of the document and force it to be unique per document (see the unique-constraint in my annotation).

However, I would probably solve this problem as follows:

@Entity
public class Document {

    @Id
    private String id;

    @OneToMany(mappedBy = "document")
    private final Set<Attachment> attachments = Sets.newHashSet();

}

@Entity
@Table(uniqueConstraints = {
        @UniqueConstraint(columnNames = { "document_id", "version" })
})
public class Attachment {

    @Id
    private Long id;

    @Column(nullable = false, name = "version")
    private Long version;

    @Lob
    @Column(name = "content")
    private byte [] content;

    @JoinColumn(name = "document_id")
    @ManyToOne(optional = false)
    private Document document;

    @JoinColumn(name = "previous_version_id")
    @ManyToOne(optional = true)
    private Attachment previousVersion;

}

Where I would still want a partial-unique-index for the document when previous_version_id is null, ensuring that there is only one initial version per document.

With both os these solutions, you avoid any cloning of documents. With the first (using deltas) you save a bit on space, because you only ever store the complete document contents of the first version of each document. However, with the second version, you have much easier retrieval of any particular version.