Assume I have RDD of Integer from 1 to 1,000,000,000 and I want to print them ordered using foreachPartition. There might be situation that the partition of 5-6-7-8 will be printed before 1-2-3-4. How can I prevent this?
Thanks, Maya
Assume I have RDD of Integer from 1 to 1,000,000,000 and I want to print them ordered using foreachPartition. There might be situation that the partition of 5-6-7-8 will be printed before 1-2-3-4. How can I prevent this?
Thanks, Maya
I think the only way to do this would be to ensure there is only one partition, and then you could print your data. You can call repartition(1) or coalesce(1) on your RDD to reduce the number of partitions. For your use case I think coalesce is better as it avoids a shuffle.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations