DynamoDB single table design

499 views Asked by At

I'm building an app using DynamoDB as the database. My app has a concept of users, organizations, projects and scripts.

  • An Organization has many projects
  • A project or organization has many scripts
  • A user can be part of a single organization so an organization can have many users.
  • A script can either belong to a project or an organization. Each script has revisions which needs to be store so that users can rollback to previous version.

Use cases the app needs to query:

  • Get all users belonging to a organization
  • Get all projects belonging to a organization
  • Get all scripts belonging to a project
  • Get script by id
  • Get all versions of the script by script id
  • Get latest version of the script by script id
  • Get organization by org id
  • Get project by project id
  • Get script by script id

I was thinking of creating a single table Entities:

pk sk type (gsi1-pk) parent (gsi1-sk)
root#org-1 info org root
org-1#proj-1 info project org-1
proj-1#script-1 info script proj-1
proj-1#script-1 rev_latest script_revision script-1
proj-1#script-1 rev_1 script_revision script-1
proj-1#script-1 rev_2 script_revision script-1
org-1#user-1 info user org-1
org-1#user-2 info user org-1

PK and SK would allow me to get the data of specific items providing the ID. For the queries where I have the id of a parent object and need to fetch the children, I would create a GSI and query against that.

For PK, I could drop the first token before # which is supposed to represent the parent object which is redundant since it's defined in the parent column but I left it there to prevent accidental cross-object updates.

Am I better off with multiple DynamoDB tables to store the info?

I would prefer not using scan queries for economic reasons. Ideally I can get all the data via query and scan operation.

2

There are 2 answers

5
fedonev On BEST ANSWER

Here's one single-table option. The design satisfies your access patterns using query operations. No scan operations are required.

pk sk gsi1pk gsi1sk
org#org-1 xInfo
proj#proj-1 xInfo org#org-1 proj#proj-1#xInfo
script#script-1 xInfo proj#proj-1 script#script-1#xInfo
script#script-1 rev#2023-10-05T22:00:00 proj#proj-1 script#script-1#rev#2023-10-05T22:00:00
script#script-1 rev#2023-03-13T10:00:00 proj#proj-1 script#script-1#rev#2023-03-13T10:00:00
script#script-1 rev#2022-12-20T08:00:00 proj#proj-1 script#script-1#rev#2022-12-20T08:00:00
user#user-1 xInfo org#org-1 user#user-1#xInfo
user#user-2 xInfo org#org-1 user#user-1#xInfo

The table primary key models the records. The partition key pk has the format <entity-type>#<entity-id>. I find the entity-type prefix helpful to visually identify a record's entity type, but it is optional here. The sort key sk defines the record type. Scripts have two record types, info and revisions. Giving script revisions a timestamp "id" makes it easy to retrieve all revisions or only the latest revision1. I renamed info to xInfo to move the "info" record after all revisions in lexical ordering2. Why do this? So we can query a script by pk in descending order (ScanIndexForward: false) with a limit of 2 to get a script's info record and latest revision in a single operation. No limit returns all revisions and the info record.

Use the table primary key to answer all the "by id" queries. We need an index to handle the "belongs to" queries.

The secondary index partition key (gsi1pk) models the "belongs to" relation. Organizations appear not to belong to anything, so the index is blank. The index sort key gsi1sk takes the form <entity-type>#<entity-id>#<record-type>. The record-type suffix isn't needed except for scripts, but seems like a good idea to apply everywhere for consistency and future-proofing.

"Get all users belonging to a organization" is answered with a query of gsi1pk = org#org-1 and gsi1sk > "user#". The same pattern satisfies the other "belongs to" queries.

A note on your closing question "Am I better off with multiple dynamodb tables...?": Including multiple tables in your design doesn't give you more modelling options. You are still working with partition and sort keys. Multiple tables may be called for in certain cases, but it doesn't really help with modelling as such.


  1. The timestamp component also gives you the option to query for a subset of revisions by time period (e.g. current year's revisions).

  2. You could use ~info or z instead of xInfo, as long as the prefix character comes after rev in UTF-8 sort order. We do this to engineer a useful script sort key (descending) order: info record, latest revision, latest-1 revision, etc.

2
Quassnoi On

The pattern you're trying to implement is called EAV (entity-attribute-value).

It has its uses, mainly for multi-tenant databases where the set of entities and their attributes can be set up and changed at run-time.

For instance, on an online shopping platform, a vendor can set up a set on attributes just for the washing machines (load type, number of cycles, energy efficiency etc.), and add and remove the attributes as they see fit, without changing the database layout.

It's a pain to index and query data laid out like that, so this kind of model is a necessary evil for certain use cases, of which yours isn't one.

I think design recommendations are off-topic for this side, as they are naturally too wide and opinion-based. So I'll point out a couple of more concrete problems with your design as this is something can be objectively addressed.

Hot partitions

Putting a hardcoded resource type id as a partition key for your GSI will make all the records go into a single partition, which will become a "hot partition" in DynamoDB parlance.

Every time you add, say, a script revision, Dynamo will have to add an entry to the GSI on (type, parent), which will go to the single partition storing the PK value script_partition

DynamoDB recommendations explicitly spell it out:

Status code, where there are only a few possible status codes: Bad

Lack of natural ordering

Your revisions are keyed as rev_1, rev_2 and rev_latest. rev_10, in the sorting order, will come between rev_1 and rev_2. You will either to always pull all the revisions and sort them on the client, or add and index an extra field with the revision date, filled out only for the revisions. If you do the latter, why not do all the way and just create a separate table?

Extra updates

Every time you add a new revision, you will need to update the previous rev_latest to rev_xxx, and you will only know what xxx is after a full sort (or querying a secondary index). This will also be prone to race conditions.