Multiple mongodb servers seen as one and data flow management

508 views Asked by At

For my application I need to move old data periodically from a mongodb server to another one (ie, two distinct servers). I also want to be able to query those data as if they were the same database.

In short terms, I want to be able to see two mongodb instances (on two different servers) as one and be able to control when and where the data is stored.

I read about the concept of sharding and chunks and rapidly saw the moveChunk function which can easily do what I want.

The problem is that it seems to be impossible to configure such architecture in mongodb. Am I missing something here?

1

There are 1 answers

0
bagrat On

Archiving Deleted Documents

For the problem of keeping the deleted documents, you have no option to achieve this with build-in features/mechanisms like sharding or replication. The only way to do it is to handle that case manually, e.g. holding a separate collection for deleted documents, and simply move documents to that collections instead of deleting them.


For your global problem of moving data you have the following two options:

Sharding

Using sharding you will split your data into pieces which will be stored on two (in your case) different servers. In this scenario you can use the moveChunk method as you have mentioned. But this method is very tricky, as for that you will need to disable the built-in automatic balancer to have a full manual control over your chunks. Anyway, this is not recommended by the MongoDB:

Only use the moveChunk in special circumstances such as preparing your sharded cluster for an initial ingestion of data, or a large bulk import operation. In most cases allow the balancer to create and balance chunks in sharded clusters

Besides this will only allow to split data, and finally, to get to your goal, you will end up with one full and one empty shard.


Replication

The replication approach is much more safe and easy to achieve. You can simply configure a replica set and add your second server to that set.

If the data is too big, you can configure your second server as hidden. So that no reads will be performed towards that server, so no inconsistent data will be received. After the data replication is finished, you will have the copy of your data on both servers.

As for using both servers as a single server, if you need to balance the requests between these tow, you can configure your readPreference to secondary, which will assure that all the reads are being sent to the secondary server, and writes by default are done on the primary.

In this case your code will be unaware of what server you are querying. You will just run your client methods, and the rest will be done behind the drivers.


Conclusion

So my advice would be to use the replication approach as more clean, pain-less and safe solution.