I am reviewing the code of a development and I need to avoid or use a different way of adding columns with the use of the 'withColumn' function in a dataframe; but I have the following doubts:

  1. Using nested 'withColumn', create new tables (as the code below)? using 6 'withColumn', create 6 new in memory tables?
newDataframe = table
.withColumn("name", col("consolidate").cast(DecimalType(17,2)))
.withColumn("name", col("consolidate").cast(DecimalType(17,2)))
  1. If the use of many 'withColumn' increases memory usage and lowers performance (if true), how can I avoid using 'withColumn' when adding columns in a dataframe and get the same result?

  2. Is there a way that consumes less memory and is faster to run without using 'withColumn', but getting the same result ?, that is, a dataframe with 6 columns added

I don't know how to do this.

The code to optimize is like this:

def myMethod(table: DataFrame): DataFrame = {
    newDataframe = table
      .withColumn("name", col("consolidate").cast(DecimalType(17,2)))
      .withColumn("id_value", col("east").cast(DecimalType(17,2)))
      .withColumn("x_value", col("daily").cast(DecimalType(17,2)))
      .withColumn("amount", col("paid").cast(DecimalType(17,2)))
      .withColumn("client", col("lima").cast(DecimalType(17,2)))
      .withColumn("capital", col("econo").cast(DecimalType(17,2)))
    newDataframe
  }

1 Answers

0
Fabich On

There is a misconception here : Spark does not create the 6 intermediary datasets in memory. In fact your function won't trigger any change in memory as Spark transformations (such as withColumn) are evaluated lazily only when a action is called (such as .count() or .show()).

When the action is called Spark will optimize your transformations and do them all at once, so there is no problem with calling 6 times .withColumn in term of memory.