I have two dataframes which look like this:
df1(20M rows):
+-----------------+--------------------+--------+----------+
| id | geolocations| lat| long|
+-----------------+--------------------+--------+----------+
|4T1BF1FK1HU376566|kkxyDbypwQ????uGs...|30.60 | -98.39 |
|4T1BF1FK1HU376566|i~nyD~~xvQA??????...|30.55 | -98.27 |
|4T1BF1FK1HU376566|}etyDzqxvQb@Sy@zB...|30.58 | -98.27 |
|JTNB11HK6J3000405|kkxyDbypwQ????uGs...|30.60 | -98.39 |
|JTNB11HK6J3000405|i~nyD~~xvQA??????...|30.55 | -98.27 |
df2(50 rows):
+---------+-----------+--------------------+
| lat| long| state|
+---------+-----------+--------------------+
|63.588753|-154.493062| Alaska|
|32.318231| -86.902298| Alabama|
| 35.20105| -91.831833| Arkansas|
|34.048928|-111.093731| Arizona|
I want to get a new column 'state' in df1 by comparing lat-long in df1 and df2. From the below dataframes ,Join on lat-long will give zero records, so I am using a threshold and using that I am performing join operation:
threshold = F.lit(3)
def lat_long_approximation(col1, col2, threshold):
return F.abs(col1 - col2) < threshold
df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))
This is taking a long time. Can anyone help me out how can I optimise this join or any better approach where I can avoid using separate function(lat_long_approximation)
You can use
between
. I am not sure about the performance.