In Cassandra , I read that I need to design my table schema such that minimum number of partitions are hit. I have designed the schema to meet this requirement. But I am in a scenario where I need to get all the partition keys alone. So I am planning to use
Select Distinct <partitionKeys> from table
I ran a distinct query using cqlsh for around 15k rows .It was quite fast.
Questions
- Will there be any performance issues if I use distinct ?
- How cassandra fetches partition keys alone ?
- I need to know the limitations on distinct query.
Basically, Cassandra just has to rip through the nodes and pull back the partition (row) keys for that table. Querying by these keys is how Cassandra was designed to work, so I am not surprised that this performed really well for you. The drawback, is that it will probably have to hit all or most of your nodes to complete the operation, so performance could be slow if you have a large number of nodes.
This is where the difference between CQL rows and rows in the underlying storage comes into play. If you look at your data with the
cassandra-cli
tool, you can see how partition keys are treated differently. Here is an example where crew members of a ship are stored in a table, by their ship.But when I query within the
cassandra-cli
:This is intended to show how 9 CQL rows are actually only 1 row "under the hood."
In CQL,
DISTINCT
will only work on your partition keys. I am not sure as to how many rows will negate its usefulness. 15000 CQL rows should be fine for it. But if you have millions of distinct partition keys (high cardinality) I would expect performance to drop off...especially with several nodes in your cluster.