Understanding internal data storing by cassandra

1k views Asked by At

I have this table

create table comment_by_post
(
    postId uuid,
    userId uuid,
    cmntId timeuuid,
    cmntTxt text,   
    cmntBy text,
    time bigint, 
    primary key ((postId, userId),cmntId)
)

here is internal data in this table

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e537a
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270721107000)
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e743434, timestamp=1434270721107000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92eec7a
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270483603000)
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434270483603000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270492468000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434270492468000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270500239000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270500239000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e237a
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270709667000)
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270709667000)

If i do primary key (postId, userId,cmntId) then its like:

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a5d94c30-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264201331000)

Why it is like that and what is the benefit of both ?

2

There are 2 answers

2
Aaron On BEST ANSWER

Christopher already explained how the partitioning keys are concatenated together to generate the rowkey for storage, so I won't re-hash (no pun intended) that. But I will explain the advantages and disadvantages of these two approaches.

PRIMARY KEY (postId, userId,cmntId)

With this PRIMARY KEY, your data is partitioned by postId, and clustered by userId and cmntId. What this means, is that all comments made on a post will be stored together on-disk by postId, and then sorted by userId and cmntId (respectively).

The advantage here, is that you have query flexibility. You can query all comments for a post, or all comments for a post by a specific user.

The disadvantage, is that you have a higher chance of unbounded row growth than your other solution. If your total columns per postId ever were to exceed 2 billion, you would max out how much data you could store per postId. But the odds of you storing that much comment data per post are low, so you should be ok.

PRIMARY KEY ((postId, userId),cmntId)

This solution helps negate the possibility of unbounded row growth, by storing comment data together by a concatenated rowkey of postId and userId (sorted by cmntId. That's the advantage over your other solution.

The disadvantage is that of losing query flexibility, as now you need to provide postId and userId with every query. This PRIMARY KEY definition simply will not support queries for comments with only postId, as Cassandra CQL requires you to provide the entire partition key for a query.

1
Christopher Bradford On

The first primary key uses postId and userId as partition keys with cmntId as a clustering column. Note the value used for the RowKey contains both the values from postId and userId separated by a :. Next the value for the clustering column is used in the name of each cell within the row.

In your second example the primary key is missing parenthesis around the partition key. They may be omitted, but are often preferred to be present as we can explicitly determine which parts of the primary key are for the partition and clustering. When the extra parenthesis are not included only the first column is used as the partition key (visible in the RowKey value from cassandra-cli). All subsequent columns are assumed to be clustering columns which we can verify by looking at the cell names.