I want to create a new boolean column in my dataframe that derives its value from the evaluation of two conditional statements on other columns in the same dataframe:
columns = ["id", "color_one", "color_two"]
data = spark.createDataFrame([(1, "blue", "red"), (2, "red", None)]).toDF(*columns)
data = data.withColumn('is_red', data.color_one.contains("red") | data.color_two.contains("red"))
This works fine unless either color_one or color_two is NULL in a row. In cases like these, is_red is also set to NULL for that row instead of true or false:
+-------+----------+------------+-------+
|id |color_one |color_two |is_red |
+-------+----------+------------+-------+
| 1| blue| red| true|
| 2| red| NULL| NULL|
+-------+----------+------------+-------+
This means that PySpark is evaluating all of the clauses of the conditional statement rather than exiting early (via short-circuit evaluation) if the first condition happens to be true (like in row 2 of my example above).
Does PySpark support the short-circuit evaluation of conditional statements?
In the meantime, here is a workaround I have come up with to null-check each column:
from pyspark.sql import functions as F
color_one_is_null = data.color_one.isNull()
color_two_is_null = data.color_two.isNull()
data = data.withColumn('is_red', F.when(color_two_is_null, data.color_one.contains("red"))
.otherwise(F.when(color_one_is_null, data.color_two.contains("red"))
.otherwise(F.when(color_one_is_null & color_two_is_null, F.lit(False))
.otherwise(data.color_one.contains("red") | data.color_two.contains("red"))))
)
I don't think Spark support short-circuit evaluation on conditionals as stated here https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#:~:text=Spark%20SQL%20(including,short-circuiting%E2%80%9D%20semantics.:
Another alternative way would be creating an array of column_one and column_two, then evaluate if the array contains 'red' using SQL
EXISTS