How to maintain data consistency across multiple tables in cassandra?

2.1k views Asked by At

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.

For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries

CREATE TABLE fans (
    fan_id uuid,
    fan_attr_1 int,
    fan_attr_2 int
    PRIMARY KEY ((fan_id))
)

CREATE TABLE actors (
    actor_id uuid,
    actor_attr_1 int,
    actor_attr_2 int
    PRIMARY KEY ((actor_id))
)

CREATE TABLE actors_by_fan (
    fan_id uuid,
    actor_id uuid,
    actor_attr_1 int,
    actor_attr_2 int
    PRIMARY KEY (fan_id, actor_id)
)

CREATE TABLE fans_by_actor (
    actor_id uuid,
    fan_id uuid,
    fan_attr_1 int,
    fan_attr_2 int
    PRIMARY KEY (actor_id, fan_id)
)

Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.

On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.

However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.

This problem occurs for any time you want to update any attribute of either fans or actors.

I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.

What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?

EDIT: - Fixed sentence stubs - Added context and prior research

3

There are 3 answers

2
Brad Schoening On

Materialized Views are probably the best choice:

CREATE MATERIALIZED VIEW actors_by_fan 
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans 
PRIMARY KEY (fan_id, actor_id);

CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors 
PRIMARY KEY (actor_id, fan_id);

In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.

1
Clement Amarnath On

Data Modeling

Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.

1) Spread data evenly around the cluster

2) Minimize the number of partitions read

Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.

Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.

More Info:

http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html

CQL

Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0

Please see

How to ensure data consistency in Cassandra on different tables?

My preference would be to change the data model and design it accordingly for our queries and if possible make it as a single big table.

Hope it Helps!

0
xmas79 On

The way you solve these kind of problems is to manually update all the changed records.

Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:

  1. Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
  2. Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
  3. Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.

As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.

This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).