Spark local rdd Write to local Cassandra DB

189 views Asked by At

I have a DSE cluster where every node in the cluster has both spark and Cassandra running.

When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.

If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver. I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.

2

There are 2 answers

2
davedamoon On

A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.

I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.

Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.

0
Alex Ott On

When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:

  • When Spark is collocated with Cassandra, then Spark is trying to pull data from the same node
  • When Spark is on different node, then it's using token-aware routing, and read data from multiple nodes in parallel, as it's defined by the partition ranges.

When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.

You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.

P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.