I have two nodes and I created a keyspace like this:
DESCRIBE uzzstore
CREATE KEYSPACE uzzstore WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
CREATE TABLE uzzstore.chunks (
id blob PRIMARY KEY,
size bigint
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
Currently I have only node 1 receiving queries (read/writes) and from what I have understood from the documentation, all writes should be replicated; therefore I assume both nodes will have the same data. I added the second node in a second moment and flushed and repaired the nodes multiple times. However, I see node 1 has about 213,435,988 rows and node 2 only 206,916,617 rows.
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.51 17.5 GB 256 ? 15450683-e34b-475d-a393-ad25611398d8 rack1
UN 192.168.1.100 17.92 GB 256 ? 6cad2ba2-b22e-4947-a952-dc65c616a08f rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Is it an expected behaviour? Is my understanding of the replicas incorrect? (note that I left several time to the cluster to get on pair).
You're right, if you have a two-node cluster and a keyspace with
replication_factor
2, then indeed every piece of data will be in both nodes, every write will be "eventually" replicated to both. If you useCL=ALL
you can be sure this has happened by the time that the write completed - but even if you doCL=ONE
the write will still happen eventually on the second node - usually very quickly, but after a repair (which you said you did) you can be sure the same data appears on both nodes, and both nodes should have exactly the same number of rows.Yet, you said "I see node 1 has about 213,435,988 rows and node 2 only 206,916,617 rows.". How sure are you about these numbers? How did you come by them? Did you really scan the table (how did you limit the scan to just one node?), or did you use some sort of "size estimate" feature? If it's the latter, you should be aware that on both Cassandra and Scylla, this is just an estimate. It turns out that this estimate is even less accurate and trustworthy in ScyllaDB than in Cassandra (see https://github.com/scylladb/scylladb/issues/9083) but in both of them, the question of whether or not you did a major compaction (
nodetool compact
) affects the estimate. You said that you "flushed and repaired" the tables but not that you compacted it.In any case I want to emphasize again that even though compaction affects the estimate of the number of partitions, it doesn't have any affect on the correctness or the data or the accurate number of rows you see if you'll scan the entire table with
SELECT * FROM table
or count them withSELECT COUNT(*) FROM table
. A repair might be needed if hinted handoff wasn't enabled and your cluster had connectivity problems during the write - but since you did say you did repair, you should be good.