Can anyone suggest a method of versioning ATTRIBUTE (rather than OBJECT) data in DB

522 views Asked by At

Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.

I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.


A simple example might be

 - Food (Table)
     - id (INT)
     - name (VARCHAR(255))
     - weight (DECIMAL) 

So we could insert an object into the database that looks like...

Food banana = new Food("Banana",0.3);

giving us

+----+--------+--------+
| id |  name  | weight |
+----+--------+--------+
|  1 | Banana | 0.3    |
+----+--------+--------+

if we then want to update the weight we might use

banana.weight = 0.4;
banana.save();

+----+--------+--------+
| id |  name  | weight |
+----+--------+--------+
|  1 | Banana | 0.4    |
+----+--------+--------+

Obviously though this is going to overwrite the data.

I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision

- Food (Table)
      - id (INT)
      - name (VARCHAR(255))
      - weight (DECIMAL) 
      - revision (INT)

+----+--------+--------+----------+
| id |  name  | weight | revision |
+----+--------+--------+----------+
|  1 | Banana | 0.3    |        1 |
|  1 | Banana | 0.4    |        2 |
+----+--------+--------+----------+

But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.


What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.

This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.

+----+--------+--------+
| id |  name  | weight |
+----+--------+--------+
|  1 | Banana | 0.3    |
+----+--------+--------+

+-----+------------+-------------+-----------+-----------+----------+
| ID  | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food       | weight      |         1 | 0.4       |        2 |
+-----+------------+-------------+-----------+-----------+----------+

Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.

This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.

1

There are 1 answers

1
philipxy On BEST ANSWER

A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.

Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.

Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)

Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:

You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.

Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.

The notion that one needs to use EAV "so every entity type can be extended with custom fields" is mistaken. Just implement via calls that update metadata tables sometimes instead of just updating regular tables: DDL instead of DML.