I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver. I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.
A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.