My question is around empty value handling/identifying in spark3 compared to spark2. Context is around reading data from a Hive Database (nothing has changed on the database), not reading from CSV, JSON or other. We are switching over to spark3 and busy testing, when we found this troublesome handling of empty values.
First, I'll show the spark2 results: Query:
df = spark.sql(""" select id
,type
,value_1
,value_2
,case
when value_1 in ('0') then True
when value_2 = '0' and value_1 in ('0',' ', '') then True
else False
end as is_recent_value_zero
from db.table
""")
+--------------------+-------+
|is_recent_value_zero| count|
+--------------------+-------+
| true|1539323|
| false| 138414|
+--------------------+-------+
so all is well, spark2 identifies empty values.
However, switching over to spark3, returns
+--------------------+-------+
|is_recent_value_zero| count|
+--------------------+-------+
| false|1677737|
+--------------------+-------+
so it is not picking up the empty values.
I constructed tests to see if I can identify the value that is being pulled from the database (by the way, nothing has changed on the database):
tmp = df.groupBy('value_1', 'value_2').count()
tmp = tmp.withColumn('value_1_len', f.length(col('value_1')))
tmp = tmp.withColumn('test1', when(col('value_1').isNull(), lit("null"))
.when(col('value_1')=='', lit("no space"))
.when(col('value_1')==' ', lit("single space"))
.when(col('value_1')==None, lit("none"))
.when(col('value_1')==np.nan, lit("np none"))
.when(col('value_1').isin('', ' '), lit("list 1"))
.when(col('value_1').isin('', ' ', "", " "), lit("list 2"))
.when((col('value_1').isin('', ' ')) | (col('value_2').isNull()), lit("list 3"))
.otherwise(lit("other")))\
.withColumn('test2', when(col('value_2').isNull(), lit("null"))
.when(col('value_2')=='', lit("no space"))
.when(col('value_2')==' ', lit("single space"))
.when(col('value_2')==None, lit("none"))
.when(col('value_2')==np.nan, lit("np none"))
.otherwise(lit("other")))\
returns
+-------+-------+-------+-----------+-----+-----+
|value_1|value_2| count|value_1_len|test1|test2|
+-------+-------+-------+-----------+-----+-----+
| | 0|1539323| 0|other|other|
| | | 127568| 0|other|other|
| | 2| 3994| 0|other|other|
| | 1| 4913| 0|other|other|
| | 3| 1939| 0|other|other|
+-------+-------+-------+-----------+-----+-----+
The "empty" value in the field 'value_1' is not being identified as anything I can think of. And the length of the field is zero!
After playing around with a couple of things, i found that wrapping a trim()
function around the value_1 or value_2 solves the issue.
using trim(value_1)
and trim(value_2)
, yields:
+--------------------+-------+
|is_recent_value_zero| count|
+--------------------+-------+
| true|1539323|
| false| 138414|
+--------------------+-------+
Does anyone know if there is a different way empty/null values are identified (spark2 vs spark3), or how we can handle this (apart from the trim function - as we'd need to do that every time we want to reference that field)? Is it perhaps a setting on the server side, or some interpretation layer in pyspark (excuse the non-technical language)?
any insights would be helpful. Thank you.