Designing an EAV database correctly for historical data

2.1k views Asked by At

Intro

I have been reading about EAV database and most of the short comings seem to be related to really, really, bad EAV designs or difficulty generating reports from the data.

Usually when you see people complaining about EAV they are using less than three tables to try to replicate the functionally of separate tables + columns in a RDBMS. Sometimes that means storing everything from decimals to strings in a single TEXT value column. EAV also messes with the safe-guards over data integrity which can be very bad if you are not careful.

However, EAV does provide an easy way to track historical data and allows us to move parts of the system back and forth between SQL and key-value store systems.

What if we separate different entity attributes based on their type. This would allow us to still handle belongsTo, Has, HasMany, and HasManyThrough relations in addition to properly indexed values tied to specific attributes and entities.

Considering the following two base entities

products (price -> decimal, title -> string, desc -> text, etc...)
    attributes
        options
            [...]
        int
        datetime
        string
        text
        decimal
        relation
            [id,foreign_key]

users (gender -> options, age -> int, username -> string, etc...)
    attributes
        options
            [...]
        int
        datetime
        string
        text
        decimal
        relation
            [id,foreign_key]

RDBMS Schema Design

As we all know, users profiles and products are some of the most diverse items in the world. Each company handles them differently and has different "columns" or "attributes" for their needs.

The following is a view of how to handle multiple (nested and/or relational) entities.

The idea is that for each entity has this master attribute table that then specifies how to find and interpret those values. This allows us to handle special cases like foreign keys to other entities and things like "options" or decimal numbers.

entity_type { id, type, // i.e. "blog", "user", "product", etc.. created_at }

entity {
    id,
    entity_type_id, 
    created_at
}

    attr {
        id,
        entity_id,
        type,
        name,
        created_at
    }

        option {
            id,
            attr_id,
            entity_id,
            multiple, // multiple values allowed?
            name,
            created_at
        }

        attr_option {
            id
            attr_id,
            entity_id,
            option_id
            option,
            created_at
        }

        attr_int {
            attr_id,
            entity_id,
            int,
            created_at
        }

        attr_relation {
            attr_id,
            entity_id,
            entity_fk_id,
            created_at
        }

        attr_datetime {
            attr_id,
            entity_id,
            datetime,
            created_at
        }

        attr_string {
            attr_id,
            entity_id,
            var_char,
            created_at
        }

        attr_text {
            attr_id,
            entity_id,
            text,
            created_at
        }

        attr_decimal {
            attr_id,
            entity_id,
            decimal,
            created_at
        }

A table like this would allow us to never have to UPDATE ... since we could just INSERT INTO ... for each new attribute that changes value and add the created_at to know what the most recent value is. This is perfect for keeping records of historical data (Exceptions could still be made of course).

Sample queries

First, what "type" of entity is it? (user, post, comment, etc..)

SELECT * FROM entity_type et LEFT JOIN entity e ON e.entity_type_id = et.id WHERE e.id = ?

Next, what are the attributes of this entity? (TABLE attr)

SELECT * FROM attr WHERE entity_id = ?

Next, what values exist in the attributes for this entity? (attr_### tables)

SELECT * FROM attr_option, attr_int, attr_relation, attr_text, ... WHERE entity_id = ?
vs
SELECT * FROM attr_option WHERE entity_id = ? if( ! multiple) ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_int WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_relation WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_text WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
...

What relations exist for this entity?

Assuming we have a "post" entity with an ID of 34 and we want the "comments" for it (entity_type = 2), this could allow us to fetch comment entity ids on a product entity:

SELECT * FROM entity AS e
LEFT JOIN attr_relation AS ar ON ar.entity_id = e.id
WHERE ar.entity_id = 34 AND e.entity_type = 2;

Apart from multiple queries, (which are needed with key-value stores anyway), what problems would exist with this approach?

3

There are 3 answers

1
philipxy On

An EAV 'database' [sic] is literally mathematically straightforwardly an undocumented description in triples of a database and its metadata, with no functionality to tablulate relationships, or query relationships, or query metadata, or type check, or maintain integrity, or optimize, or transact atomically, or control concurrency.

Software engineering principles dictate that sound EAV database [sic] use consist entirely of defining appropriate abstractions (types, operators, processes, interpreters, modules) reconstructing functionality of a DBMS.

The mechanical nature of the mapping from one's EAV triples and their meanings to a (fragmented) database description makes this easy to show.

To paraphrase Greenspun, any sufficiently complex EAV project contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a DBMS.

I repeat: EAV is an undocumented description in triples of a database and its metadata, with no DBMS. Use EAV only for parts of a database where you have demonstrated that a DDL solution cannot meet performance requirements and that an EAV solution can and is worth it.

0
Erwin Smout On

I have been reading about EAV database and most of the short comings seem to be related to really, really, bad EAV designs or difficulty generating reports from the data.

The difficulties generating reports derive inherently and inevitably from the kind of facts that an EAV DB represents: "The value of attribute BIRTHDATE for person XYZ is ..." "The value of attribute DECEASEDATE for person XYZ is ..." etc. etc.

That is NOT the typical form in which end users think of the data structures for carrying information about person XYZ (or any other), ergo somewhere in between the end user and the DB, additional transformation (very akin to pivoting, though not exactly 100%) is necessary. Each additional transformation is a potential source of bugs and of performance loss.

Usually when you see people complaining about EAV they are using less than three tables to try to replicate the functionally of separate tables + columns in a RDBMS. Sometimes that means storing everything from decimals to strings in a single TEXT value column.

That is ONLY ONE of the downsides of EAV. Attribute-level type constraints become harder or impossible to define. But there are others besides this.

EAV also messes with the safe-guards over data integrity which can be very bad if you are not careful.

This one is entirely correlated to the difficulty of report generation, which is the exact same thing as the difficulty of expressing meaningful queries, which is the exact same thing as the difficulty of expressing the scenarios that constitute a violation of some given rule.

However, EAV does provide an easy way to track historical data and allows us to move parts of the system back and forth between SQL and key-value store systems.

BS & baloney. EAV applied rigorously will move the time information just as far away from the things it applies to as any other "regular" attribute. If you don't do this, then you're not applying EAV any longer (strictly). See Bill Karwin's answer: EAV != 6NF !!!!!!!!! 6NF still has all the "structure" that any other "regular" DB also has, EAV is all about (see philipxy's answer and Bill's 'inner platform' remark) effectively removing that structure from the DB.

3
Bill Karwin On

Here are some problems with this design.

  • How would you query for the current value of all integer attributes of a given entity?

  • How would you model an attribute that should be NOT NULL? That is, make sure a given attribute is mandatory for its entity, and the entity cannot be created without a value for that attribute.

  • How would you model a UNIQUE column? Assume you can change an attribute's value and then change it back to the original value.

  • How do you support foreign keys that reference a entity with something other than an integer primary key?

  • How do you restrict a given attribute to the set of values in a lookup table?

The only way to solve most of these is by using application code. That's the problem with EAV: you end up reinventing many constraints we take for granted with SQL. It's an example of the Inner Platform Effect antipattern:

The inner-platform effect is the tendency of software architects to create a system so customizable as to become a replica, and often a poor replica, of the software development platform they are using.

Sixth normal form is not EAV. In Sixth normal form, you need a separate table per attribute, not per data type. You use conventional columns with the appropriate name and data type. Storing this attribute in a distinct table is what gives you the ability to store historical revisions.

This means you still can't model NOT NULL in 6NF, but at least you can model UNIQUE and FOREIGN KEY in a pretty conventional way.