NOSQL denormalization datamodel

12.6k views Asked by At

Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.

What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.

Is it true that data is often stored denormalized like in my example?

3

There are 3 answers

0
George On BEST ANSWER

You are correct, the data is often stored de-normalized in NoSQL databases.

The problem with the updates is partially where the term "eventual consistency" comes from.

In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.

You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views

0
Thijs Koerselman On

I have been working for the past 7 years with NoSQL (Firestore) for 2 fairly big projects where I was able to write code from scratch (both around 50k LoC and one has about 15k daily active users). I didn't use denormalization at all. The concept never appealed to me, and document reads are fairly cheap in Firestore.

To come back to your example; loading the other data for the chess game seems way more important than instantly being able to show the name. I would load the name based on the user id in the background and put a simple client-side memoize / cache around it to prevent fetching the same user document over and over.

What I did use quite a bit to solve performance issues is generate derived data. I would set a listener on a database document "onWrite" and then store some computed data in another derived document. These documents would automatically update when the source changes, so it doesn't complicate things really. In the case of a chess game, a distilled document could be the leaderboard that is constantly shown to all users of the app.

Another optimization I had to do was to distill a long list of titles + metadata for recently opened "projects". Firestore on the web client side doesn't give the ability to select fields from a document in a query. It only fetches full documents and that was too much data for the list, so we solved this by making an API endpoint to fetch the distilled data through there.

I'm not saying you should follow my advice, but we seem to be doing well in terms of code complexity and database costs. So when I read that NoSQL requires data denormalization I become skeptical :)

That's my 2 cents.

0
mzaink On

One way to look at it is that the number of times the user changes his/her name is extremely rare. But the number of times that board data is read and changed is immense.

So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.

Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).

Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.

For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.