How to compare all columns values in a csv file with multiple other column values by doing join only once in spark java

42 views Asked by At

I have a dataframe of a csv file Dataset<Row> fileDf that looks like -

|COUNTRY_ID|COUNTRY_NAME|REGION_ID|
|----------|------------|---------|
|AR        |Argentina   |        2|
|AU        |Australia   |        3|
|BE        |Belgium     |        1|
|BR        |Brazil      |        2|

I will have multiple single column dataframes (each dataframe corresponds to a single column values) from other csv files Dataset<Row> singleColumnDf1

|NAMES                   |
|------------------------|
|Argentina               |
|Australia               |
|Belgium                 |
|Brazil                  |

Dataset<Row> singleColumnDf2

|IDs      |
|---------|
|        2|
|        3|
|        1|
|        2|

and so on. I want to take singleColumnDf1 and do join operation with fileDf and find out what columns contains the values from singleColumnDf1 and display the column names. Similarly I want to do the same with singleColumnDf2 and so on. As the fileDf will be constant I want to make join only once beacuse I dont want the overhead, because my main focus is performance and then get the matched column names and I want the result that tells me what columns from fileDf are matched what columns of singleColumnDfs.

For Example I want the final result to be->

*for singleColumnDf1 the matched column is COUNTRY_NAME,*

*for singleColumnDf2 the matched column is COUNTRY_ID*

Is this possible to do in spark can i combine the two singleColumnDfs together and do join once with fileDf and get the results using keys do differentiate the results? Or should I do join for each and every singleColumnDf?

0

There are 0 answers