Why does SQL standard allow duplicate rows?

7.2k views Asked by At

One of the core rules for the relational model is the required uniqueness for tuples (rows):

Every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values were equal. If there was no meaningful way to guarantee uniqueness, a surrogate key could be presented to the table.

When the first SQL standard was released, it defined no such restriction and it has been like this ever since. This seems like a root for all kind of evil.

Is there any meaningful reason why it was decided to be that way? In a practical world, where could an absence of such restriction prove to be useful? Does it outweigh the cons?

7

There are 7 answers

0
Ben On BEST ANSWER

You're assuming that databases are there solely for storing relational data; that's certainly not what they're used for because practical considerations will always win.

A obvious example where there's no need for a primary key would be a "state" log of some description (weather/database/whatever). If you're never going to query a single value from this table you may not want to have a primary key in order to avoid having to wait for an insert into the key. If you have a use-case to pick up a single value from this table then sure, this would be a bad solution, but some people just don't need that. You can always add a surrogate key afterwards if it becomes absolutely necessary.

Another example would be a write intensive application needs to tell another process to do something. This secondary process runs every N minutes/hours/whatever. Doing the de-duplication on N million records as a one off is quicker than checking for uniqueness on every insert into the table (trust me).

What are sold as relational databases are not being used solely as relational databases. They're being used as logs, key-value stores, graph databases etc. They may not have all the functionality of the competition but some do and it's often simpler to have a single table that doesn't fit your relational model than to create a whole other database and suffer the data-transfer performance penalties.

tl;dr People aren't mathematically perfect and so won't always use the mathematically perfect method of doing something. Committees are made up of people and can realise this, sometimes.

0
Rahul On

In a SQL world, that would mean that there could never exist two rows in a table for which all the column values are equal and that's true. unless all the attributes of that tuple matches with another, it's not a duplicate one even if it dose differ only by the primary key column.

That's why we should define other key (unique key) column(s) along with the primary key to identify each record as unique.

1
Guffa On

Although that is normally how tables work, it's not practical to have it as a rule.

To follow the rule, a table must always have a primary key. That means that you can't just remove the primary key on a table and then add a different one. You would need to both changes at once, so that the table never is without a primary key.

1
nvogel On

The short answer is that SQL is not relational and SQL DBMSs are not relational DBMSs.

Duplicate rows are a fundamental part of the SQL model of data because the SQL language doesn't really try to implement the relational algebra. SQL uses a bag (multiset)-based algebra instead. The results of queries and other operations in relational algebra are relations that always have distinct tuples, but SQL DBMSs don't have the luxury of dealing only with relations. Given this fundamental "feature" of the SQL language, SQL database engines need to have mechanisms for processing and storing duplicate rows.

Why was SQL designed that way? One reason seems to be that the relational model was just too big a leap of faith to make at that time. The relational model was an idea well ahead of its time. SQL on the other hand, was and remains very much rooted in the systems of three decades ago.

9
Erwin Smout On

The very first versions of the language did not have any form of constraints, including keys. So uniqueness could simply not be enforced. When support for constraints (keys in particular) was later added to the language, operational systems had already been written, and nobody wanted to break backward compatibility. So it (allowing duplicates) has been there ever since.

Many neat little topics of historical background, just like this one, can be found in Hugh Darwen's book "SQL : A comparative survey" (freely available from bookboon).

(EDIT : presumably the reason why there was no support for constraints in the very first versions of the language, was that at the time, Codd's main vision was that the query language would effectively be a query (i.e. read-only) language, and that the "relational" aspect of the DBMS would be limited to having a "relational wrapper layer" over existing databases which were not relational in structure. In that perspective, there is no question of "updating" in the language itself, hence no need to define constraints, because those were defined and enforced in the "existing, non-relational database". But that approach was abandoned pretty early on.)

0
Roland On

I don't know the answer to this question, but can share my opinion that I always thought that a duplicate tuple (=row) in a relation (=table) has no meaning, no added information, value. In my 30+ years of Oracle and PostgreSQL I have spent countless hours on removing duplicates. So the argument that removing duplicates is "hard" for a system is moot. Yes it is hard, so delegate this task to the machine. That's why we use a machine. Let's not do this ourselves.

Some answer argued that, with a table with names and ages, a query on just the ages might show duplicate ages, which may be handy for running statistics. My point is that the duplicate ages are just annoying, you will need to remove the duplicates to get a result that is meaningful, that you can publish. If you need to do statistics, you should not just query for the ages, but make a query for the relation between age and the number of people with that age, which will not have duplicates:

select age, count(*)
from persons
group by age

result e.g.:

+-----+-------+
| age | count |
+-----+-------+
| 24  |   2   |
| 25  |   1   |
+-----+-------+

Think relational, every tuple has columns that relate to each other.

If a query result has duplicate rows, the duplicate is noise, with no meaning, and needs to be removed.

I have not seen a use for duplicates.

6
user3673 On

One reason that no one talks about is that Codd was simply wrong to elide duplicate rows. He ignored Russell and Whitehead's Principia Mathematica's final chapters, one of which was devoted to "Relation Arithmetic". At HP's "Internet Chapter 2" project, I was able to peel off a little money to hire one of Paul Allen's thinktank participants who had been working on quantum programming languages to bring it into computer network programming. His name was Tom Etter, and he went back and reviewed Russell's work. He discovered a flaw in it -- a limitation that Russell himself admitted -- and figured out how to remove that limitation. Perhaps Codd did look at Relation Arithmetic but was put off by this limitation. I don't know. But what I do know is that it is obvious what the utility of duplicate rows are:

What many people end up doing in SQL is keep an extra column for duplicate row counts. Aside from Codd's "twelve rules" declaring that you aren't allowed access to the counts within his "relational algebra" there is the fact that duplicate row count column doesn't belong at the same level of abstraction as the row data themselves. If you want to treat it as "data", it is what is called "metadata". There are all kinds of ways to get confused about levels of abstraction. Here's a way to get _un_confused about this particular distinction between levels:

Imagine an implementation of relational database "algebra" that permits duplicate rows. Let's try to not get so-confused and imagine this simple use case:

Age, Name
24, John
25, Mary
24, Elitsa

Now we perform a project operation from the higher dimensional relation to a lower dimensional relation, Age:

Age
24
25
24

The latent variables of the higher dimensions are absent from this view but their cases are still represented in the statistics of this table. In terms of data representation inclusion of an additional column for the times a row occurs works just fine so long as we avoid confusion about the levels of abstraction. Why might one want to just leave the duplicate rows in the projected table rather than counting them up? Well, that's very situation-dependent but a clear case is in Monte-Carlo simulation sampling and/or imputation of missing values:

You just use an even distribution random number generator to pick out which row you want to sample.

These statistics are essential to distinguish Etter's (and Russell's) "relation numbers" from Codd's "relations". Moreover, the SQL specification violates Codd's "relation algebra" in precisely this manner.

A practical implication of adopting relations as the most general formalism in computer programming languages:

Nowadays people are spending vast amounts of time dealing with parallelizing their computations. While it is true that functional programming permits a level of abstraction in which some parallelism is inherent to the structure of the program, it is limited to what might be thought of as "and parallelism". For instance:

z^2 = x^2+y^2

If x and y are specified, this is a functional expression that permits the parallel evaluation of the subexpressions x^2 and y^2 because of their independence. But what if, instead of treating "=" as an assignment, we treat it as a relation? Now, without duplicating the code, we have a different "computer program" depending on which of the variables are specified. For instance if z is specified but not x and y, this constrains the algebraic solutions to a geometric shape. This is, in fact, how parametric CAD systems operate.

To illustrate further we can return to the simple constraint:

x^2 = 1

Here, again, '=' is a constraint, but watch what now happens with the resulting relation table:

x
1
-1

This is "or parallelism". It is of a fundamentally different character than "and parallelism". It is, in fact, the sort of parallelism that differentiates between abstraction levels of application level programming and operating system level programming -- but that is far from the only way we elide "or parallelism" and cause vast suffering.