Elassandra data modeling: When to create a secondary index and when not to

280 views Asked by At

I'm not an absolute expert of Cassandra, but what I know (correct me if I'm wrong) is that creating a secondary index for all the fields in a data model is an anti-pattern.

I'm using Elassandra and my data model looks like this :

  • A users object that represents a user, with : userID, name, phone, e-mail, and all kind of infos on users (say these users are selling things)
  • A sales object that represent a sale made by the user, with : saleID, userID, product name, price, etc. (There can be a lot more fields)

Given that I want to make complex searches on the user (search by phone, search by e-mail, etc etc) only on name, e-mail and phone, is it a good idea to create the 3 following tables from this data model :

  • "User core" table with only userID, name, phone and e-mail (fields for search) [Table fully indexed and mapped in Elasticsearch]
  • "User info" table with userID + the other infos [Table not indexed or mapped in Elasticsearch]
  • "Sales" table with userID, saleID, product name, price, etc. [Table not indexed or mapped in Elasticsearch]

I see at least one advantage : Any kind of indexation (or reindexation when changes happen) and associated costs will happen only if there is a change in the "User core" table, which should not change too frequently. Also, if I need to get all other infos (User other infos or sales), I can just make 2 queries: 1 in "User core" to get the userID and 1 in the other table (with the userID) to get the other data.

But I'm not sure this is a good pattern, or maybe I should not worry about secondary indexation and just basically index any other table ?

In a more summarized way, what are the key reasons to chose - a secondary index like Elasticsearch in Elassandra - VS - denormalizing tables and use partition&clustering keys - ?

Please feel free to ask if you need more examples on my use case.

2

There are 2 answers

4
Erick Ramirez On

You should not normalise the tables when you're using Cassandra. The most important aspect of data modelling for Cassandra is to design one table for each application query. To put it another way, you should always denormalise your tables.

After you've modelled a table for each query, index the table with Elassandra which contains the most columns that you need to query.

It's important to note that Elassandra is not a magic bullet. In a lot of cases, you do not need to index the tables if you have modelled them against your application queries correctly.

The use case for Elassandra is to take advantage of features such as free-form text search, faceting, boosting, etc., but it will not be as performant as a native table. The fact is index lookups require more "steps" than a straight-forward single-partition Cassandra read. Of course, YMMV depending on your use case and access patterns. Cheers!

0
Alex Tbk On

I dont think Erick´s answer is fully correct in case of Elassandra. It is correct that native Cassandra queries will outperform elastic and in pure cassandra you should wrap your tables around the queries.

But if you prefer flexibility over performance (and this is why you mainly choose to use elassandra), you can use cassandra as primary storage and benefit from cassandra´s replication performance and index the tables for search in elastic.

This enables you to be flexible on the search side and still be sure not to lose data, in case something goes wrong on the elastic side.

In fact on production we use a combination of both: tables have its partition / clustering keys and are indexed in elastic (when necessary). In backend you can decide, if you can query by cassandra keys or if elastic is required.