Intro
I have been reading about EAV database and most of the short comings seem to be related to really, really, bad EAV designs or difficulty generating reports from the data.
Usually when you see people complaining about EAV they are using less than three tables to try to replicate the functionally of separate tables + columns in a RDBMS. Sometimes that means storing everything from decimals to strings in a single TEXT
value column. EAV also messes with the safe-guards over data integrity which can be very bad if you are not careful.
However, EAV does provide an easy way to track historical data and allows us to move parts of the system back and forth between SQL and key-value store systems.
What if we separate different entity attributes based on their type. This would allow us to still handle belongsTo, Has, HasMany, and HasManyThrough relations in addition to properly indexed values tied to specific attributes and entities.
Considering the following two base entities
products (price -> decimal, title -> string, desc -> text, etc...)
attributes
options
[...]
int
datetime
string
text
decimal
relation
[id,foreign_key]
users (gender -> options, age -> int, username -> string, etc...)
attributes
options
[...]
int
datetime
string
text
decimal
relation
[id,foreign_key]
RDBMS Schema Design
As we all know, users profiles and products are some of the most diverse items in the world. Each company handles them differently and has different "columns" or "attributes" for their needs.
The following is a view of how to handle multiple (nested and/or relational) entities.
The idea is that for each entity has this master attribute table that then specifies how to find and interpret those values. This allows us to handle special cases like foreign keys to other entities and things like "options" or decimal numbers.
entity_type { id, type, // i.e. "blog", "user", "product", etc.. created_at }
entity {
id,
entity_type_id,
created_at
}
attr {
id,
entity_id,
type,
name,
created_at
}
option {
id,
attr_id,
entity_id,
multiple, // multiple values allowed?
name,
created_at
}
attr_option {
id
attr_id,
entity_id,
option_id
option,
created_at
}
attr_int {
attr_id,
entity_id,
int,
created_at
}
attr_relation {
attr_id,
entity_id,
entity_fk_id,
created_at
}
attr_datetime {
attr_id,
entity_id,
datetime,
created_at
}
attr_string {
attr_id,
entity_id,
var_char,
created_at
}
attr_text {
attr_id,
entity_id,
text,
created_at
}
attr_decimal {
attr_id,
entity_id,
decimal,
created_at
}
A table like this would allow us to never have to UPDATE ...
since we could just INSERT INTO ...
for each new attribute that changes value and add the created_at
to know what the most recent value is. This is perfect for keeping records of historical data (Exceptions could still be made of course).
Sample queries
First, what "type" of entity is it? (user, post, comment, etc..)
SELECT * FROM entity_type et LEFT JOIN entity e ON e.entity_type_id = et.id WHERE e.id = ?
Next, what are the attributes of this entity? (TABLE attr)
SELECT * FROM attr WHERE entity_id = ?
Next, what values exist in the attributes for this entity? (attr_### tables)
SELECT * FROM attr_option, attr_int, attr_relation, attr_text, ... WHERE entity_id = ?
vs
SELECT * FROM attr_option WHERE entity_id = ? if( ! multiple) ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_int WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_relation WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
SELECT * FROM attr_text WHERE entity_id = ? ORDER BY created_at DESC LIMIT 1
...
What relations exist for this entity?
Assuming we have a "post" entity with an ID of 34 and we want the "comments" for it (entity_type = 2), this could allow us to fetch comment entity ids on a product entity:
SELECT * FROM entity AS e
LEFT JOIN attr_relation AS ar ON ar.entity_id = e.id
WHERE ar.entity_id = 34 AND e.entity_type = 2;
Apart from multiple queries, (which are needed with key-value stores anyway), what problems would exist with this approach?
An EAV 'database' [sic] is literally mathematically straightforwardly an undocumented description in triples of a database and its metadata, with no functionality to tablulate relationships, or query relationships, or query metadata, or type check, or maintain integrity, or optimize, or transact atomically, or control concurrency.
Software engineering principles dictate that sound EAV database [sic] use consist entirely of defining appropriate abstractions (types, operators, processes, interpreters, modules) reconstructing functionality of a DBMS.
The mechanical nature of the mapping from one's EAV triples and their meanings to a (fragmented) database description makes this easy to show.
To paraphrase Greenspun, any sufficiently complex EAV project contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a DBMS.
I repeat: EAV is an undocumented description in triples of a database and its metadata, with no DBMS. Use EAV only for parts of a database where you have demonstrated that a DDL solution cannot meet performance requirements and that an EAV solution can and is worth it.