I am running an application on pyspark. For this application below is the snapshot of the distribution of executors. It looks like non-uniformly distributed. Can someone have look and tell where is the problem.
Discription and My Problem:-
I am running my application on huge data, in which I am filtering and joining 3 datasets. After that, I am caching joined data set for generating and aggregating features for the different time period (means my cached data set generate features in the loop). After this, I am trying store these features in a partquet file. This parquet file is taking too much time.
Can any help me to solve this? let me know if you need further information.
While my initial suggestion would be to use as little shuffle operations like joins as much as possible. However, if you wish to persist, some suggestions I can provide are to tune your SparkContext in the following ways: