DynamoDB one-to-one

175 views Asked by At

Hello stackoverflow community,

This question is about modeling one-to-one relationships with multiple entities involved.

Say we have an application about students. Each Student has:

  • Profile (name, birth date...)
  • Grades (math score, geography...)
  • Address (city, street...).

Requirements:

  1. The Profile, Grades and the Address only belong to one Student each time (i.e. one-to-one).
  2. A Student has to have all Profile, Grades and Address data present (there is no student without grades for example).
  3. Updates can happen to all fields, but the profile data mostly remain untouched.
  4. We access the data based on a Student and not by querying for the address or something else (a query could be "give me the grades of student John", or "give me profile and address of student John", etc).
  5. All fields put together are bellow the 400kb threshold of DynamoDB.

The question is how would you design it? Put all data as a single row/item or split it to Profile, Grades and Address items?

2

There are 2 answers

0
chris litsas On

My solution is to go with keeping all data in one row defined by the studentId as the PK and the rest of the data follow in a big set of columns. So one item looks like [studentId, name, birthDate, mathsGrade, geographyGrade, ..., city, street].

I find that like this I can have transnational inserts/updates (with the downside that I always have to work with the full item of course) and while querying I can ask for the subset of data needed each time. On top of the above, this solution fits with two of the most important AWS guidelines about dynamo:

  1. keep everything in a single table and
  2. pre-join data whenever possible.

The reason for my question is that I could only find one topic in stackoverflow about one-to-one modeling in DynamoDB and the suggested solution (also heavily up-voted) was in favor of keeping the data in separate tables, something that reminds me a relational-DB kind of design (see the solution here).

I understand that in that context the author tried to keep a more generic use case and probably support more complex queries, but it feels like the option of putting everything together was fully devalued.

For that reason I'd like to open that discussion here and listen to other opinions.

0
bd_ On

A Basic Implementation

Considering the data and access patterns you've described, I would set up a single student-data table with a partition key that allows me to query by the student, and a sort key that allows me to narrow down my results even further based on the entity I want to access. One way of doing that would be to use some kind of identifier for a student, say studentID, and then something more generalized for the sort key like entityID, or simply SK.

At the application layer, I would classify each Item under one possible entity (profile, grades, address) and store data relevant to that entity in any number of attributes that I would need on that Item.

An example of how that data might look for a student named john smith:

{ studentId: "john", entityId: "profile", firstName: "john", lastName: "smith" }

{ studentId: "john", entityId: "grades", math2045: 96.52, eng1021:89.93 }

{ studentId: "john", entityId: "address", state: "CA", city: "fresno" }

With this schema, all your access patterns are available:

"give me the math grades of student john"

PartitionKey = "john", SortKey = "grades"

and if you store address within the students profile entity, you can accomplish "give me profile and address of student John" in one shot (multiple queries should be avoided when possible)

PartitionKey = "john", SortKey = "profile"

Consider

Keep in mind, you need to take into account how frequently you are reading/writing data when designing your table. This is a very rudimentary design, and may need tweaking to ensure that you're not setting yourself up for major cost or performance issues down the road.

The basic idea that this implementation demonstrates is that denormalizing your data (in this case, across the different entities you've established) can be a very powerful way to leverage DynamoDB's speed, and also leave yourself with plenty of ways to access your data efficiently.

Problems & Limitations

Specific to your application, there is one potential problem that stands out, which is that it seems very feasible the grades Items start to balloon to the point where they are impossible to manage and become expensive to read/write/update. As you start storing more and more students, and each student takes more and more courses, your grades entities will expand with them. Say the average student takes anywhere from 35-40 classes and gets a grade for each of them, you don't want to have to manage 35-40 attributes on an item if you don't have to. You also may not want back every single grade every time you ask for a student's grades. Maybe you start storing more data on each grade entity like:

{ math1024Grade: 100, math1024Instructor: "Dr. Jane Doe", math1024Credits: 4 }

Now for each class, you're storing at least 2 extra attributes. That Item with 35-40 attributes just jumped up to 105-120 attributes.

On top of performance and cost issues, your access patterns could start to evolve and become more demanding. You may only want grades from the student's major, or a certain type of class like humanities, sciences, etc, which is currently unavailable. You will only ever be able to get every single grade from each student. You can apply a FilterExpression to your request and remove some of the unwanted Items, but you're still paying for all the data you've read.

With the current solution, we are leaving a lot on the table in terms of optimizations in performance, flexibility, maintainability, and cost.

Optimizations

One way to address the lack of flexibility in your queries, and possible bloating of grades entities, is the concept of a composite sort key. Using a composite sort key can help you break down your entities even further, making them more manageable to update and providing you more flexibility when you're querying. Additionally, you would wind up with much smaller and more manageable items, and although the number of items you store would increase, you'll save on cost and performance. With more optimized queries, you'll get only the data you need back so you're not paying those extra read units for data you're throwing away. The amount of data a single Query request can return is limited as well, so you may cut down on the amount of roundtrips you are making.

That composite sort key could look something like this, for grades:

{ studentId: "john", entityId: "grades#MATH", math2045: 96.52, math3082:91.34 }

{ studentId: "john", entityId: "grades#ENG", eng1021:89.93, eng2203:93.03 }

Now, you get the ability to say "give me all of John's MATH course grades" while still being able to get all the grades (by using the begins_with operation on the sort key when querying).

If you think you'll want to start storing more course information under grades entities, you can suffix your composite sort key with the course name, number, identifier, etc. Now you can get all of a students grades, all of a students grades within a subject, and all that data about a students grade within a subject, like its instructor, credits, year taken, semester, start date, etc.

These optimizations are all possible solutions, but may not fit your application, so again keep that in mind.

Resources

Here are some resources that should help you come up with your own solution, or ways to tweak the ones I've provided above to better suit you.

AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)

AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)

Best Practices for Using Sort Keys to Organize Data

NoSQL Design For DynamoDB

And keep this one in mind especially when you are considering cost/performance implications for high-traffic application:

Best Practices for Designing and Using Partition Keys Effectively