I have a Pyspark program wherein I have a function that accepts a string as a parameter. The string itself is contained in an array of strings. So essentially, I'm looping through the string array and calling the function from within the loop. While doing so, I'm appending the String output Returned by the function to another String array.
Now, my function has series of if--else statements in it which checks the argument and chooses the loop to execute. These all if blocks are independent codes and only share a global cached dataframe and a global spark session.
I want the function call to be done concurrently rather than in a FIFO manner which is happening now. Which is better option in Python for this ?
- Multithreading ?
- MultiProcessing ?
- AsyncIO ?
It would be helpful if a example code can be provided !
My example pseudo code :
global spark_session
global cached_dataframe
x = ["A","B","C"]
y=[]
function test(z):
if z=="A":
-------SOME CODE FOR "A" -------
elif z=="B":
-------SOME CODE FOR "B" -------
elif z=="C":
-------SOME CODE FOR "C" -------
for i in x:
y.append(test(i))
If concurrency is not possible here, can you suggest a better way to organise my code ? Like avoiding if else loops and so on . Because in my current requirement, this if else block is gonna go endless !
The first thought would be to change the scheduling configuration from 'FIFO' to 'FAIR':
(assuming
spark
is your SparkSession object).More about configuring the schedule is here: http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
I don't think multiprocessing would make sense here, as it's more about the scheduling of spark tasks (the compute heavy work is presumably being done by Spark). The other idea would be potentially using a queue with multiple threads:
You could also probably do something with the ThreadPoolExecutor in the concurrent.futures module (https://docs.python.org/3/library/concurrent.futures.html):
Or even:
and take advantage of the
future
objects which the executor returns. Since I'm not intimately familiar with the requirements of the application, I'm not sure how much that would benefit you, but I hope I've laid out a few potentially useful approaches using multithreading! I've personally used both approaches with spark applications and saw performance improvements.