Cassandra performance on distinct query

1.8k views Asked by At

In Cassandra , I read that I need to design my table schema such that minimum number of partitions are hit. I have designed the schema to meet this requirement. But I am in a scenario where I need to get all the partition keys alone. So I am planning to use

Select Distinct <partitionKeys> from table

I ran a distinct query using cqlsh for around 15k rows .It was quite fast.

Questions

  1. Will there be any performance issues if I use distinct ?
  2. How cassandra fetches partition keys alone ?
  3. I need to know the limitations on distinct query.
1

There are 1 answers

2
Aaron On BEST ANSWER

Will there be any performance issues if I use distinct? How cassandra fetches partition keys alone?

Basically, Cassandra just has to rip through the nodes and pull back the partition (row) keys for that table. Querying by these keys is how Cassandra was designed to work, so I am not surprised that this performed really well for you. The drawback, is that it will probably have to hit all or most of your nodes to complete the operation, so performance could be slow if you have a large number of nodes.

This is where the difference between CQL rows and rows in the underlying storage comes into play. If you look at your data with the cassandra-cli tool, you can see how partition keys are treated differently. Here is an example where crew members of a ship are stored in a table, by their ship.

aploetz@cqlsh:presentation> SELECT * FROm shipcrewregistry ;

 shipname | lastname  | firstname | citizenid                            | aliases
----------+-----------+-----------+--------------------------------------+--------------------------------------
 Serenity |      Book |    Derial | 48bc975a-c9f2-474d-8a29-247503445877 |                       {'CLASSIFIED'}
 Serenity |      Cobb |     Jayne | 2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354 |                   {'Hero of Canton'}
 Serenity |      Frye |    Kaylee | d556cf44-348b-4ea3-8c19-ba9d4877818c |                                 null
 Serenity |     Inara |     Serra | a25b7e02-8099-401a-8c41-d9d2ea894b72 |                                 null
 Serenity |  Reynolds |   Malcolm | 169382b7-21b0-47bf-b1c8-19bc008a9060 |             {'Mal', 'Sgt. Reynolds'}
 Serenity |       Tam |     River | af68201f-4135-413e-959c-dd81ea651e52 |                                 null
 Serenity |       Tam |     Simon | aa090e1a-7792-4d7b-bba9-bac66f8c1f15 |                          {'Dr. Tam'}
 Serenity | Washburne |     Hoban | 73f591df-c0dc-44c4-b3f3-9c37453c9537 |                             {'Wash'}
 Serenity | Washburne |      Zoey | 46bc77ad-53ad-4402-b252-a0543005c583 | {'Corporal Alleyne', 'Zoey Alleyne'}

(9 rows)

But when I query within the cassandra-cli:

[default@presentation] list shipcrewregistry;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: Serenity
=> (name=Book:Derial:48bc975a-c9f2-474d-8a29-247503445877:, value=, timestamp=1424904853420170)
=> (name=Book:Derial:48bc975a-c9f2-474d-8a29-247503445877:aliases:434c4153534946494544, value=, timestamp=1424904853420170)
=> (name=Cobb:Jayne:2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354:, value=, timestamp=1424904853492976)
=> (name=Cobb:Jayne:2d643fb1-54fb-4c98-8d2d-a5bb9c6c8354:aliases:4865726f206f662043616e746f6e, value=, timestamp=1424904853492976)
=> (name=Frye:Kaylee:d556cf44-348b-4ea3-8c19-ba9d4877818c:, value=, timestamp=1428442425610395)
=> (name=Inara:Serra:a25b7e02-8099-401a-8c41-d9d2ea894b72:, value=, timestamp=1428442425621555)
=> (name=Reynolds:Malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:, value=, timestamp=1424904853505461)
=> (name=Reynolds:Malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:aliases:4d616c, value=, timestamp=1424904853505461)
=> (name=Reynolds:Malcolm:169382b7-21b0-47bf-b1c8-19bc008a9060:aliases:5367742e205265796e6f6c6473, value=, timestamp=1424904853505461)
=> (name=Tam:River:af68201f-4135-413e-959c-dd81ea651e52:, value=, timestamp=1428442425575881)
=> (name=Tam:Simon:aa090e1a-7792-4d7b-bba9-bac66f8c1f15:, value=, timestamp=1424904853518092)
=> (name=Tam:Simon:aa090e1a-7792-4d7b-bba9-bac66f8c1f15:aliases:44722e2054616d, value=, timestamp=1424904853518092)
=> (name=Washburne:Hoban:73f591df-c0dc-44c4-b3f3-9c37453c9537:, value=, timestamp=1428442425587484)
=> (name=Washburne:Hoban:73f591df-c0dc-44c4-b3f3-9c37453c9537:aliases:57617368, value=, timestamp=1428442425587484)
=> (name=Washburne:Zoey:46bc77ad-53ad-4402-b252-a0543005c583:, value=, timestamp=1428442425596863)
=> (name=Washburne:Zoey:46bc77ad-53ad-4402-b252-a0543005c583:aliases:436f72706f72616c20416c6c65796e65, value=, timestamp=1428442425596863)
=> (name=Washburne:Zoey:46bc77ad-53ad-4402-b252-a0543005c583:aliases:5a6f657920416c6c65796e65, value=, timestamp=1428442425596863)

1 Row Returned.
Elapsed time: 86 msec(s).

This is intended to show how 9 CQL rows are actually only 1 row "under the hood."

I need to know the limitations on distinct query.

In CQL, DISTINCT will only work on your partition keys. I am not sure as to how many rows will negate its usefulness. 15000 CQL rows should be fine for it. But if you have millions of distinct partition keys (high cardinality) I would expect performance to drop off...especially with several nodes in your cluster.