Where App, used Spark, execute not-spark-context code

Question

Where App, used Spark, execute not-spark-context code

30 views Asked by Jelly At 25 October 2023 at 10:02

For example, I have the following code:

public static void main(String[] args) {
    RestController restController = new RestController();
    SparkSession sparkSession = SparkSession
            .builder()
            .appName("test example")
            .getOrCreate();

    Dataset<Row> csvFileDF = sparkSession.read().csv("test_csv");
    
    // code in task //
    restController.sendFile();
    // __________//
    
    csvFileDF.write().parquet("test_parquet");
}

Method restController.sendFile() executed not in spark context, as opposed to read csv and write parquet operations.

Jar runned by:

spark-submit --jar main.jar

Do I understand correctly that restController.sendFile() execuded on Driver?

Original Q&A

There are 1 answers

**Koedlt** · Accepted Answer · 2023-10-26T07:39:31+00:00

In general in Spark, the calculations that take place on your executors are the actions/transformations that you perform on distributed data (RDDs, DataFrames, Datasets). The rest takes place in the driver, because the calculations are not distributed.

So in your case, it does indeed seem like restController.sendFile() only takes place on the driver but I can't say for sure because I don't know what that method does.

Let's make a very simple example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

myList = [
    (1,),
    (2,),
    (3,),
    (4,),
    (5,),
    (6,),
    (7,),
    (8,),
    (9,),
    (10,),
]
df = spark.createDataFrame(
    myList,
    ["myInt"],
)

df2 = df.withColumn("increment", F.col("myInt") + 1)
df2.write.csv("myTestFile.csv")

myList2 = [(x[0], x[0] + 1) for x in myList]

In here, you see that we:

create a df2 dataframe by incrementing the first column by 1
create a myList2 list by doing the same thing

When looking at the spark history server for that application, we see:

Only the dataframe operation happened in our Spark context. The rest happened on the driver as a normal, non-distributed calculation.

TechQA.

Where App, used Spark, execute not-spark-context code

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in SPARKCORE

Popular Questions

Popular Tags

Trending Questions