Using MySQL unique index to prevent duplicates, instead of duplicate searching?

1.8k views Asked by At

I have a large table (5 million rows), with a unique identifier column called 'unique_id'

I'm running the INSERT query through Node.js (node-mysql bindings) and there's a chance that duplicates could be attempted to be inserted.

The two solutions are:

1) Make 'unique_id' an index, and check the entire database for a duplicate record, prior to INSERT:

'SELECT unique_id WHERE example = "'+unique_id+'" LIMIT 1'

2) Make 'unique_id' a unique index within MySQL, and perform the INSERT without checking for duplicates. Clearly, any duplicates would cause error and not be inserted into the table.

My hunch is that solution 2) is better, as it prevents a search of worse-case (5 million - 1) rows for a duplicate.

Are there any downsides to using solution 2)?

2

There are 2 answers

0
Michal M. On BEST ANSWER

There is a number of advantages to defining a unique, primary index for the unique_id column:

  • Semantic correctness - currently the name does not reflect reality as you can have duplicates in column called 'unique_id',
  • Autogenerating of unique ids - you can delegate this job to the database and avoid conflict of ids (this would not be a problem if you were using UUID instead of integers),
  • Speed gain - to be a reliable solution 1 would require a blocking transaction (no new rows should be inserted between checking for duplicate and inserting a row). Delegating this to MySQL will be much more efficient,
  • Following a common pattern - this is exactly what unique and primary indexes were designed to do. Your solution will be easy to understand to other developers,
  • Less code.

With the 2nd solution you might need to handle the attempt of inserting a duplicate (unless your unique ids are generated by MySQL).

Autoincremented primary index: https://dev.mysql.com/doc/refman/5.7/en/example-auto-increment.html

0
LSerni On

Surprisingly, it makes little difference performance-wise. The search will use (and require) the same index.

What little performance difference there is, however, is to the advantage of your (2) solution.

Actually in MySQL you can get rid of the error altogether using the IGNORE keyword:

INSERT IGNORE INTO ... VALUES (1, 2, 3), (4, 5, 6), (7, 8, 9)...;

will always succeed (will skip inserting duplicates). This allows to insert several values in a single statement, as above.

You might also be interested in the ON DUPLICATE KEY UPDATE family of tricks :-).

The real difference, as M.M. already stated, is in integrity. Using a UNIQUE index constraint you are sure of your data; otherwise, you need to LOCK the table between the moment you check it and the moment when you insert the new tuple, to avoid the risk of someone else inserting the same value.


Your (1) solution may have a place if the "duplicateness" of the data requires significant business logic work, that cannot be easily translated into a MySQL constraint. In that case you would

  • lock the table,
  • search for candidate duplicates (say you get 20 of them),
  • fetch the data and verify whether they are truly candidates
  • insert the new tuple if none conflict,
  • release the lock.

(It might be argued on good grounds that the need to do such a complicated merry-go-round stems from some error in the database design. Ideally you should be able to do everything in MySQL. But business reality has a way of being far from ideal sometimes).