Is a CQL COUNT() on a single partition also an expensive operation?

165 views Asked by At

I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra

But let's say, we have a table hotel with hotel_type as the partition key and we run query

select count(*) from hotel where hotel_type= 'luxury';

Will this be expensive too? Actually, I need to run 1 million queries like this to get the count of different hotel_types. Will finding count can impact the prod Cassandra cluster?

Update:

I saw we can use dsbulk for the counting. How dsbulk count is different from cql count()?

2

There are 2 answers

2
Erick Ramirez On BEST ANSWER

You are correct that an unbounded COUNT() is an expensive operation in Cassandra for the reasons I've explained in this post -- https://dba.stackexchange.com/questions/314567/.

But counting the rows of a single partition (i.e. SELECT query is restricted with WHERE pk = ?) is a valid use case for the CQL COUNT() function.

Additionally, it should be fine executing a million read operations asynchronously provided you throttle the requests to a throughput your cluster can handle.

The count command in the DataStax Bulk Loader (DSBulk) tool is one the recommended ways for running a full table count because DSBulk optimises the operation by breaking it up into small token segments then fires off multiple distributed requests instead of a single full table scan.

If you're interested, here are some references with examples for using DSBulk:

Cheers!

3
Hakan On

The main concept of Cassandra modelization is the Partition Key, who determine where is the data physically on the Cluster

When the Partition key is used with an a count() function, the coordinator node (who handle the request) will retrieve all associated data locally and then apply the aggregation operation locally

As this request filter by Partition Key and not need to scan all data of the table accros the Cluster, you can use it

However, you need to keep in mind that will induce more ressource consumption on the the coordinator nodes and a bench will be helpful before going to production