Does Spark SQL optimize lower() on both sides?

Question

Does Spark SQL optimize lower() on both sides?

69 views Asked by Kathmandude At 13 December 2022 at 19:27

Say I have this pseudo code in Spark SQL where t1 is a temp view built off of partitioned parquet files in HDFS and t2 is a small lookup file to filter the said temp view

select t1.* 
from t1
where exists(select * 
             from t2 
             where t1.id=t2.id and 
                   lower(t1.col) like lower(t2.pattern))  --to mimic ilike functionality

Will the optimizer treat lower(t1.col) like lower(t2.pattern) as case insensitive match? Or will it run transformations on these columns before performing the match?

I don't have access to the DAG to see what exactly happens behind the scenes so I am asking here to see if this is a known/documented optimization trick.

Original Q&A

There are 1 answers

**Abdennacer Lachiheb** · Answer 1 · 2022-12-13T23:54:14+00:00

I tried to reproduce that case using scala and then I called explain() to get the physical plan (I'm pretty sure sql and scala will have the same physical plan because behind the scene it's the same optimizer named “Catalyst”)

import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("Car", "car"), ("bike", "Rocket"), ("Bus", "BUS"), ("Auto", "Machine") )).toDF("c1", "c2")
df1.filter(lower(col("c1")).equalTo(lower(col("c2")))).explain()

== Physical Plan ==
*(1) Project [_1#3 AS c1#8, _2#4 AS c2#9]
+- *(1) Filter ((isnotnull(_1#3) AND isnotnull(_2#4)) AND (lower(_1#3) = lower(_2#4)))
   +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false, true) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false, true) AS _2#4]
      +- Scan[obj#2]

As you can see in the logical plan it will call lower each time to compare the 2 values: lower(_1#3) = lower(_2#4).

Btw I tried same thing joining 2 dataframe, then filtering on lower but I got the same result.

I hope this answer your question.

TechQA.

Does Spark SQL optimize lower() on both sides?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in CATALYST-OPTIMIZER

Popular Questions

Popular Tags

Trending Questions