I am trying to run horovod.torch on gpu clusters (p2.xlarge) from databricks.
Because horovod use AllReduce to communicate parameters among the nodes, each worker node needs to load the whole dataset and work on the different partitions. After each iteration, all nodes get the parameters values from other nodes by AllReduce to update their own parameters by taking averages of them.
My understanding is that this is SPMD (single program multiple data) becauee each worker node needs to load the same whole dataset.
I need to load the whole dataset from each worker node, right ?
My code:
import horovod.torch as hvd
from sparkdl import HorovodRunner
def test1():
hvd.init()
train_df = spark.read.parquet("s3://my_data/").cache()
print("load data done")
hr = HorovodRunner(np=2)
hr.run(test1)
But I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
It seems that spark does not allow multiple context ?
I also tried to create a new local spark session on each worker:
def test1():
hvd.init()
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
train_df = spark.read.parquet("s3://my_data/").cache()
hr = HorovodRunner(np=2)
hr.run(test1)
I got error:
[1,1]<stderr>:Error: Could not find or load main class org.apache.spark.launcher.Main
[1,1]<stderr>:/databricks/spark/bin/spark-class: line 101: CMD: bad array subscript
How to use spark to load data on each worker node ?
If spark does not allow other nodes to create their own sparksession, how to load data on each worker node for horovod ?