Need to check null on dataframe columns and update a dataframe column in efficient way.
For ex - dataframe is -
And doing null check on every column(Column1 to Column20) and updating error_notes as shown below -
val df1 = data.withColumn("error_notes", when(col(column1).isNull, concat_WS("!", Column1 is null, col("error_note")).otherwise("error_notes"))
val df2 = df1.withColumn("error_notes", when(col(column2).isNull, concat_WS("!", Column2 is null, col("error_notes")).otherwise("error_notes"))
val df3 = df2.withColumn("error_notes", when(col(column3).isNull, concat_WS("!", Column3 is null, col("error_notes")).otherwise("error_notes"))
.
.
.
.
val df20 = df19.withColumn("error_notes", when(col(column20).isNull, concat_WS("!", Column20 is null, col("error_notes")).otherwise("error_notes"))
While executing null check for all the columns and updating error_note column taking longer time(almost 4 hours) to finish as dataframe size is huge. Is there any efficient and performant way to resolve this issue.
Checks for all columns can be prepared, and method
withColumn
used only once:Result is: