How to get total no of rows in a cassandra column family using Java?

133 views Asked by At

I want to get total number of rows in the column family. I know using nodetool cfstats we can get approx rows. But how to get it using Java Client.

3

There are 3 answers

2
Jim Wartnick On

The only way I have been able to do this is to write code that essentially executes a "select * from " and then fetches small rows at a time. The counter is actually maintained by the java code, not cassandra. Unfortunately with cassandra, the read-timeouts are small (5 seconds for one/local_one and 10 seconds for anything else). You have to keep your fetch sizes down so that each fetch does not time out. If the table is huge, it could take a while to complete the count, but it does work. Keep in mind the count could be changing while your query is running, so it in-itself, is an "estimate". I have a modular piece of java code if you're interested.

0
Chris Lohfink On

You can query the system.size_estimates table to get approx sizes of partitions pre range on a single host. From the size of your cluster and your RF you can make a ball park estimate from that. It really depends on how accurate you want it. For precise measurements I would recommend Spark but if its something you really need a runtime track of it might be worth using a counter that you update with changes to quickly be able to read from.

0
Alex Ott On

As Chris mentioned, you can get approximate number of rows via JMX metrics, and more precise calculation is could be done by Spark. If you need to do it via Java client, then you'll need to perform operation similar to the Spark's - perform counting of the rows by the token ranges - in this case you're issuing the queries that are executed by individual hosts, without overloading the coordinator as happens if you do naive select * from table. The query is looks like here (it's pseudo-code, not real query!) SELECT columns FROM table WHERE token(pk) > token_range.begin AND token(pk) <= token_range.end. The trick that should be done there is that you need to set routing key explicitly as token aware load balancing policy isn't able to extract it from that query automatically.

The full source code is quite long to be included here, but you can find it here.