Cassandra spark connector data loss

Question

Cassandra spark connector data loss

226 views Asked by Odomontois At 09 June 2015 at 09:44

Recetly we were calculating some statistics using datastax spark connector. Repeated queries were returrning different results on each execution.

Background: we have approx. 112K records in 3-node cassandra cluster. The table have single partition key UUID column named guid and no clustering key columns.

This is simple guid extractor i defined to examine losses:

val guids = sc.cassandraTable[UUID]("keyspace","contracts").select("guid")

Next i've repeatedly extracted data to local collections several times

val gss = List.fill(20)(Set(guids.collect():_*))
val gsall = gss reduce (_ | _)
val lost = gss map (gsall &~ _ size)

Resulted lost is List(5970, 7067, 6926, 6683, 5807, 7901, 7005, 6420, 6911, 6876, 7038, 7914, 6562, 6576, 6937, 7116, 7374, 6836, 7272, 7312)

so we have 6,17±0,47% data loss each query

Could this be the problem of cassandra, spark or connector? And in each case - are there exist some configuration way to prevent this?

Original Q&A

There are 1 answers

**Odomontois** · Accepted Answer · 2015-06-09T10:53:40+00:00

Odomontois On 09 June 2015 at 10:53 BEST ANSWER

I've read some docs and learned that reading consistensy level could and should be set for such situations. After declaring

implicit val readConf = ReadConf.fromSparkConf(sc.getConf).copy(
    consistencyLevel = ConsistencyLevel.ALL)

I've got my stable result.

TechQA.

Cassandra spark connector data loss

There are 1 answers

Related Questions in SCALA

Related Questions in CASSANDRA

Related Questions in APACHE-SPARK

Related Questions in DATASTAX

Popular Questions

Popular Tags

Trending Questions