Recetly we were calculating some statistics using datastax spark connector. Repeated queries were returrning different results on each execution.
Background: we have approx. 112K records in 3-node cassandra cluster. The table have single partition key UUID
column named guid
and no clustering key columns.
This is simple guid
extractor i defined to examine losses:
val guids = sc.cassandraTable[UUID]("keyspace","contracts").select("guid")
Next i've repeatedly extracted data to local collections several times
val gss = List.fill(20)(Set(guids.collect():_*))
val gsall = gss reduce (_ | _)
val lost = gss map (gsall &~ _ size)
Resulted lost
is List(5970, 7067, 6926, 6683, 5807, 7901, 7005, 6420, 6911, 6876, 7038, 7914, 6562, 6576, 6937, 7116, 7374, 6836, 7272, 7312)
so we have 6,17±0,47%
data loss each query
Could this be the problem of cassandra, spark or connector? And in each case - are there exist some configuration way to prevent this?
I've read some docs and learned that reading consistensy level could and should be set for such situations. After declaring
I've got my stable result.