I have a scheduled task that generates around 50 transfers per day per user. This is an example transfer that is stored in a transfers collecting in MongoDB:

transfer = {
  _id: 'ObjectId("...")',
  transferScheduleId: 'ObjectId("...")',
  TargetA: 'ObjectId("...")',
  TargetB: 'ObjectId("...")',     <-- optional
  SourceA: 'SourceId("...")',
  SourceB: 'SourceId("...")',     <-- optional
  value: 1234.5678,               <-- can be variable or constant
  date: '1900-01-01'
}

I'm still learning NoSQL/MongoDB so I want to understand more about pros & cons of different ways of storing this data and I'm looking for feedback and ideas.

Goal & Setup: Since my app is meant to scale with more users, I want to efficiently store transfers to reduce cost. Transfers are generated in a web app and then stored in MongoDB. This data will be accessed regularly by a user and when a user logs on the array will be sent to a web app via a node.js API w/ mongoose.

My Question is about storing each transfer and whether to use collections or embedded documents. Given the 16MB limit of a BSON document in MongoDB, I first concluded to store transfer documents in a collection.

Furthermore, a transferSchedule collection has the following properties and is used to configure a transfer:

transferSchedule = {
  _id: 'ObjectId("...")',
  userId: 'ObjectId("...")',
  endDate: '1900-01-01' <-- optional
  startDate: '1900-01-01' 
}

I would rather like embed transfers in a transferSchedule document because I'm assuming that it would be faster to fetch since a user will have up to 50 transferSchedules on his/her profile. I'm also assuming that I could compress the data in an emded document, but more about this in the next paragraph. Is my assumption right ? Is it faster to fetch 50 transferSchedule objects by userId where each has an embedded array of transfers ? Or is faster to fetch all transfers from a collection and find each by transferScheduleId?

Ok, so with embedding, there is still the problem of the 16 MB size limit of the document. To reduce the size of the array, I was thinking of a.) compressing the data in the array, but that will require for every user and transferSchedule to fetch the full compressed array, decompress it, add a new transfer, compress and put it back in the database. Instead I was thinking of b.) create a custom compression algorithm that stores only needed properties to recreate the transfer array in its entirety. The algorithm would only store the start date and end date of known transfer execution and the properties that don't change on all transfers in that time period. Whenever there is a change to the value, source or target, then a new entry is created. Here is what this would look like:

transferSchedule = {
  _id: 'ObjectId("...")',
  userId: 'ObjectId("...")',
  endDate: null                      <-- optional
  startDate: '2019-01-01' 
  transfers: [
    {
      startDate: '2019-01-01',
      endDate: '2019-03-26',        <-- inclusive
      value: 100
      SourceA: 'SourceId("abc123")',
      TargetA: 'ObjectId("def456")',
    },
    {
      startDate: '2019-03-27',
      endDate: '2019-05-03',        <-- current date of most recent transfer
      value: 2000                   <-- change in data between March 26 & 27 
      SourceA: 'SourceId("abc123")',
      TargetA: 'ObjectId("def456")',
    },
  ]
}

When the document is loaded in the web app, the array can be reconstructed using the above transfers array.

I'm not sure if I overthink this. Coming from a SQL background I'm still unsure of the scalability and performance impacts of collections also with regards to CRUD operations. Are any of my approaches valid? I would appreciate any feedback, tips and ideas. Thank you!

0 Answers