For example, I have the following code:
public static void main(String[] args) {
RestController restController = new RestController();
SparkSession sparkSession = SparkSession
.builder()
.appName("test example")
.getOrCreate();
Dataset<Row> csvFileDF = sparkSession.read().csv("test_csv");
// code in task //
restController.sendFile();
// __________//
csvFileDF.write().parquet("test_parquet");
}
Method restController.sendFile()
executed not in spark context, as opposed to read csv and write parquet operations.
Jar runned by:
spark-submit --jar main.jar
Do I understand correctly that restController.sendFile()
execuded on Driver?
In general in Spark, the calculations that take place on your executors are the actions/transformations that you perform on distributed data (RDDs, DataFrames, Datasets). The rest takes place in the driver, because the calculations are not distributed.
So in your case, it does indeed seem like
restController.sendFile()
only takes place on the driver but I can't say for sure because I don't know what that method does.Let's make a very simple example:
In here, you see that we:
df2
dataframe by incrementing the first column by 1myList2
list by doing the same thingWhen looking at the spark history server for that application, we see:
Only the dataframe operation happened in our Spark context. The rest happened on the driver as a normal, non-distributed calculation.