I have the following columnfamily defined in cassandra
CREATE TABLE metric (
period int,
rollup int,
tenant text,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Does the size of data list affect the read performance in cassandra ? If yes how can we measure it..?
The issue is that the time taken to query Data-Set1 from cassandra to get 8640 rows (where # of elements in the data list for each row is 90) for a given path/period/rollup combination is more than the time required to query Data-Set 2 which is 8640 rows of data (where # of elements in the data list for each row is 10).
Also if I run a performance test with 10 users accessing Data-Set1 simultaneously, then I start seeing cassandra timeouts in the backend, and it spends a lot of time in Garbage collection, but the same does not happen when I do the same by querying Data-Set2.
So I am concluding that the number of elements in the data list is affecting performance.
Are you seeing similar performance issues in your cassandra stack....?
I wouldn't think that 90 items in a collection would be that big of a deal, but in your case I guess it is. The problem is that when you query a collection column, Cassandra can't just return parts of the collection. It has to return the entire column (collection). That operation isn't free, but I wouldn't think that 90 doubles would be a big deal.
One thing to try is to turn tracing on. That should give you some idea of what Cassandra is doing when you are running your query.
Often, turning on tracing can lead you to the cuplrit.
Are you using any special JVM settings? How much RAM do you have on each node? GC that interrupts normal operations indicates (to me) that there might be an issue with your JVM heap settings. The DataStax doc on Tuning Java Resources indicates that you should use the following guidelines on sizing your heap, based on your node's RAM: