how to improve performance in pyspark joins

158 views Asked by At

I have two dataframes which look like this:

df1(20M rows):

+-----------------+--------------------+--------+----------+                    
|              id |        geolocations|     lat|      long|
+-----------------+--------------------+--------+----------+
|4T1BF1FK1HU376566|kkxyDbypwQ????uGs...|30.60   | -98.39   |
|4T1BF1FK1HU376566|i~nyD~~xvQA??????...|30.55   | -98.27   |
|4T1BF1FK1HU376566|}etyDzqxvQb@Sy@zB...|30.58   | -98.27   |
|JTNB11HK6J3000405|kkxyDbypwQ????uGs...|30.60   | -98.39   |
|JTNB11HK6J3000405|i~nyD~~xvQA??????...|30.55   | -98.27   |

df2(50 rows):

+---------+-----------+--------------------+
|      lat|       long|               state|
+---------+-----------+--------------------+
|63.588753|-154.493062|              Alaska|
|32.318231| -86.902298|             Alabama|
| 35.20105| -91.831833|            Arkansas|
|34.048928|-111.093731|             Arizona|

I want to get a new column 'state' in df1 by comparing lat-long in df1 and df2. From the below dataframes ,Join on lat-long will give zero records, so I am using a threshold and using that I am performing join operation:

threshold = F.lit(3) 
def lat_long_approximation(col1, col2, threshold):
    return F.abs(col1 - col2) < threshold

df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))

This is taking a long time. Can anyone help me out how can I optimise this join or any better approach where I can avoid using separate function(lat_long_approximation)

1

There are 1 answers

0
Lamanus On BEST ANSWER

You can use between. I am not sure about the performance.

threshold = 10 # for test
df1.join(F.broadcast(df2), 
         df1.lat.between(df2.lat - threshold, df2.lat + threshold) & 
         df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()