I have a series of ~30 datasets that all need to be joined together for making a wide final table. This final table takes ~5 years of individual tables (one table per year) and unions them together, then joins this full history with the full history of other tables (similarly unioned) to make a big, historical, wide table.
The layout of these first, per year tables is as such:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
With other year tables like this:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 1 |
| key_2 | 1 |
These are then unioned together to create:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
Similarly, a second type of table when unioned results in the following:
table_type_2:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
I now want to join table_type_1
with table_type_2
on primary_key
and year
to yield a much wider table. I notice that this final join takes a very long time and shuffles a lot of data.
How can I make this faster?
What I advice you is: to make a first union on small datasets then to broadcast the dataset ,result of the first union , spark will deploy that dataset on its different nodes which will reduce the number of shuffles. The union on spark is well optimized so what you have to do is to think about the possess : select only columns that you need from the beginning, avoid any kind of non cost effective operations before the union like groupByKey ...etc because spark will call those operations when it makes the final process. I do advise you to avoid hive because it uses the map reduce strategy which is not worthy compared to spark sql you can use this example of a function just change the key, use scala if you can it will interact directly with spark: