I have this table
create table comment_by_post
(
postId uuid,
userId uuid,
cmntId timeuuid,
cmntTxt text,
cmntBy text,
time bigint,
primary key ((postId, userId),cmntId)
)
here is internal data in this table
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e537a
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270721107000)
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e743434, timestamp=1434270721107000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92eec7a
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270483603000)
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434270483603000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270492468000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434270492468000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270500239000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270500239000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e237a
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270709667000)
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270709667000)
If i do primary key (postId, userId,cmntId)
then its like:
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264176613000)
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434264176613000)
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264192912000)
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434264192912000)
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a5d94c30-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264201331000)
Why it is like that and what is the benefit of both ?
Christopher already explained how the partitioning keys are concatenated together to generate the rowkey for storage, so I won't re-hash (no pun intended) that. But I will explain the advantages and disadvantages of these two approaches.
With this PRIMARY KEY, your data is partitioned by
postId
, and clustered byuserId
andcmntId
. What this means, is that all comments made on a post will be stored together on-disk bypostId
, and then sorted byuserId
andcmntId
(respectively).The advantage here, is that you have query flexibility. You can query all comments for a post, or all comments for a post by a specific user.
The disadvantage, is that you have a higher chance of unbounded row growth than your other solution. If your total columns per
postId
ever were to exceed 2 billion, you would max out how much data you could store perpostId
. But the odds of you storing that much comment data per post are low, so you should be ok.This solution helps negate the possibility of unbounded row growth, by storing comment data together by a concatenated rowkey of
postId
anduserId
(sorted bycmntId
. That's the advantage over your other solution.The disadvantage is that of losing query flexibility, as now you need to provide
postId
anduserId
with every query. This PRIMARY KEY definition simply will not support queries for comments with onlypostId
, as Cassandra CQL requires you to provide the entire partition key for a query.