Differences in NoSQL databases and the likelihood of inconsistency problems

1k views Asked by At

I work at a large company that has problems with dealing with the load on the backend systems. They are looking at replacing their old legacy system/database and replace it with a horizontally scalable NoSQL database. The reason to look at NoSQL databases is to be ready for the future by using a horizontally scalable solution.

Distributed NoSQL database generally only offer eventual consistency. How much of a problem this is is yet to be investigated. In this case we are dealig with a system where there are relatively few write operations and many reads and the availability is important.

There are quite some NoSQL database systems (cassandra, mongoDB, hbase, etc). Are there any guidelines or is there literature available on which database systems are suitable in which cases? I'm also looking to get an idea of how likely it is for inconsistency problems to occur and how to reduce this chance and with what cost.

Any information/tips/references to literature are welcome.

1

There are 1 answers

0
ashic On BEST ANSWER

There's tonnes of information out there...Google is your friend :)

I can highly recommend Cassandra. It's fairly easy to set up, and is masterless + fault tolerant. You can specify how much replication you want per database, and it handles that for you. It can also do cross data centre replication. It has tunable consistency. If you want, for certain bits of data, you can achieve full consistency (sacrificing availability during the write, for example). So, it's not necessarily an all or nothing scenario. It has the notion of schema, and you store data in tables as rows, with primary keys. It has a query language, CQL, that's quite familiar to SQL (but a lot more limited). Familiarity, schema, performance, tunable consistency....quite a nice combination.

There are drawbacks. There are NO joins. As such, you have to focus a bit more on data modelling, and knowing the types of queries you'll need for real time work. The conceptual data model will likely be different to the actual physical data model. You'll likely have some information (i.e. conceptual data) exist as copies in denormalised physical tables. This results in very fast performance, but you do need to understand your data a bit.

For analytical queries, you would usually pair it up with Spark. This will allow you to query over your data set, much like Hadoop. The queries are slower than the real time stuff, but can give a good balance of total data volume and querying flexibility.

Cassandra by itself will NOT be a full text search engine. However, it's not uncommon to pair it up with Lucene or Solr to provide search capabilities.

In terms of use cases, Cassandra can be used in many forms. At its simplest, it's a key value store where each value is a collection of ordered key value pairs. The top level key-value gives you partitions (shards) of data. This allows you to store time series data very efficiently. The "values" support collection columns of sets, maps, and lists as well, and you can have "exact match indexes" on these. These allow for slightly more flexible querying. These features mean that Cassandra can be used for a wide variety of use cases, but obviously not all. It would really depend on what use case you're trying to solve. There's no single "best NOSQL" database out there. Each data store tends to have a set of use cases, and it's hard to list all mappings. Instead, you'd have to see what your use cases are, and then see which store's features overlap the most, and then pick one or possibly more.