Non-Uniform distribution of task and data on Pyspark executors

139 views Asked by At

I am running an application on pyspark. For this application below is the snapshot of the distribution of executors. It looks like non-uniformly distributed. Can someone have look and tell where is the problem.

enter image description here

Discription and My Problem:-

I am running my application on huge data, in which I am filtering and joining 3 datasets. After that, I am caching joined data set for generating and aggregating features for the different time period (means my cached data set generate features in the loop). After this, I am trying store these features in a partquet file. This parquet file is taking too much time.

Can any help me to solve this? let me know if you need further information.

2

There are 2 answers

1
Jack_The_Ripper On

While my initial suggestion would be to use as little shuffle operations like joins as much as possible. However, if you wish to persist, some suggestions I can provide are to tune your SparkContext in the following ways:

  • Use Kryo Serializer
  • Compress data before sending over the network
  • Play around with your JVM garbage collection
  • Increase your shuffle memory
0
Alper t. Turker On

As you stated (emphasis mine):

I am filtering and joining 3 datasets. After that, I am caching joined data set for generating and aggregating features

Both joins and, to lesser extent, aggregations might result in a skewed distribution of data if join key or grouping columns are not uniformly distributed - it is a natural consequence of the required shuffles.

In general case there is very little you can do about it. In specific cases it is possible to gain a little with broadcasting or salting, but it doesn't look like the problem is particularly severe in your case.