How do I sort data by the last update date in Cassandra?

1.3k views Asked by At

I need advice to correctly design the table in Cassandra. I need to get a sorted list of all the books. Sorting is performed by the date of the last update. Each time a particular book is purchased, the number_of_buyers column is updated. Also, I need to update the value of the updated_at column. The problem is the updated_at column is the clustering key which is the part of the primary key. We can't update values in columns that are part of the primary key.

create table books (
   book_id uuid,
   created_at timestamp,
   updated_at timestamp,
   book_name varchar,
   book_author varchar,
   number_of_buyers int,
   primary key (book_id, updated_at)
) with clustering order by (updated_at desc);

Another example:

create table chat_rooms (
   chat_room_id uuid,
   created_at timestamp,
   updated_at timestamp,
   last_message_content varchar,
   last_message_author varchar,
   unread_messages_number int,
   primary key (chat_room_id, updated_at)
) with clustering order by (updated_at desc);

Each chat room has the latest message. This information is always changing. In cases of change, I want to put the chat room at the top of the list. Classic behavior in many messengers.

1

There are 1 answers

5
Aaron On

So for sure; you are going to need to partition on something different. The trick is going to be finding the right balance of query flexibility (your obvious need here) while avoiding unbound partition growth.

For the books table, is it possible to partition on something like category? You know, like horror, fantasy, graphic novel, non-fiction, instructional, etc..?

CREATE TABLE book_events (
   book_id uuid,
   created_at timestamp,
   updated_at timestamp,
   book_name varchar,
   book_author varchar,
   number_of_buyers int,
   category text,
   PRIMARY KEY (category, book_name, updated_at, book_id)
) WITH CLUSTERING ORDER BY (book_name ASC,updated_at DESC,book_id ASC);

For the PRIMARY KEY definition, we can partition on category, and then cluster on book_name and updated_at, with book_id on the end (for uniqueness). Then, INSERT a new row for each sale event. On the query (after inserting a few rows), use the MAX aggregation on updated_at while using the GROUP BY clause on book_name.

SELECT book_name,book_author,number_of_buyers,MAX(updated_at) FROm book_events 
 WHERE category='Computers & Technology' GROUP BY book_name;

 book_name                       | book_author                                                | number_of_buyers | system.max(updated_at)
---------------------------------+------------------------------------------------------------+------------------+---------------------------------
  Mastering Apache Cassandra 3.x |                                Aaron Ploetz, Teja Malepati |               52 | 2020-10-05 14:29:33.134000+0000
 Seven NoSQL Databases in a Week | Aaron Ploetz, Devram Kandhare, Brian Wu, Sudarshan Kadambi |              163 | 2020-10-05 14:29:33.142000+0000

(2 rows)

The only other consideration, is what to do with the obsoleted sale rows. You could delete them as you go, depending on the write frequency, of course. The most-optimal solution would be to consider the cadence of sales, and apply a TTL.

This solution is definitely not complete as-is, but I hope it leads you in the proper direction.