I have journal and data both on the same volume for a mongoDB shard, so the consistency problem of taking snapshots only after locking using fsyncLock is not needed. An EBS snapshot would be consistent point in time for a single shard.
I would like to know what is the preferred way of taking backups in mongodb cluster. I have explored two options:
- Approximate point in time consistent backup by taking the EBS snapshots around the same time. Advantage being, no write lock needs to be taken.
- Stop writes on the system, then take snapshots. This would give point in time consistent backup.
Now, I'd like to know how is it actually done in production. I've read about replica set's secondary node being used, but not clear how it gives point in time consistent backup. Unless all the secondary nodes have a consistent point in time data, the EBS snapshot cannot be point in time. For example, what if for a secondary for NodeA, data is synced with primary, but some data for secondary for NodeB is not. Am I missing something here?
Also, can if ever happen that approach 1 leads to inconsistent MongoDB cluster (when restored), such that is crashes or stuff?
Consistent backups
The first steps in any sharded cluster backup procedure should be to:
Stop the balancer (including waiting for any migrations in progress to complete). Usually this is done with the
sh.stopBalancer()
shell helper.Backup a config server (usually with the same method as your shard servers, so EBS or filesystem snapshot)
I would define a consistent backup of a sharded cluster as one where the sharded cluster metadata (i.e. the data stored on your config servers) corresponds with the backups for the individual shards, and each of the individual shards has been correctly backed up. Stopping the balancer ensures that no data migrations happen while your backup is underway.
Assuming your MongoDB data and journal files are on a single volume, you can take a consistent EBS snapshot or filesystem snapshot without stopping writes to the node you are backing up. Snapshots occur asynchronously. Once an initial snapshot has been created, successive snapshots are incremental (only needing to update blocks that have changed since the previous snapshot).
Point-in-time backup
With an active sharded cluster, you can only easily capture a true point-in-time backup of data that has been written by stopping all writes to the cluster and backing up the primaries for each shard. Otherwise, as you have surmised, there may be differing replication lag between shards if you backup from secondaries. It's more common to backup from secondaries as there is some I/O overhead while the snapshots are written.
If you aren't using replication for your shards (or prefer to backup from primaries) the replication lag caveat doesn't apply, but the timing will be still be approximate for an active system as the snapshots need to be started simultaneously across all shards.
Point-in-time restore
Assuming all of your shards are backed by replica sets it is possible to use an approximate point-in-time consistent backup to orchestrate a restore to a more specific point-in-time using the replica set oplog for each of the shards (plus a config server). This is essentially the approach taken by backup solutions such as MongoDB Cloud Manager (née MMS): see MongoDB Backup for Sharded Cluster. MongoDB Cloud Manager leverages backup agents on each shard for continuous backup using the replication oplog, and periodically creates full snapshots on a schedule. Point-in-time restores can be built by starting from a full data snapshot and then replaying the relevant oplogs up to a requested point-in-time.
What's the common production approach?
Downtime is generally not a desirable backup strategy for a production system, so the common approach is to take a consistent backup of a running sharded cluster at an approximate point-in-time using snapshots. Coordinating backup across a sharded cluster can be challenging, so backup tools/services are also worth considering. Backup services can also be more suitable if your deployment doesn't allow snapshots (for example, if your data and/or journal directories are spread across multiple volumes to maximise available IOPS).
Note: you should really, really consider using replication for your production deployment unless this is a non-essential cluster or downtime is acceptable. Replica sets help maximise uptime & availability for your deployment and some maintenance tasks (including backup) will be much more impactful without data redundancy.