i'm making a loop in pyspark, and i have this message:
"Column is not iterable"
This is the code:
(regexp_replace(data_join_result[varibale_choisie],
(random.choice(data_join_result.collect()[j][varibale_choisie])),
data_join_result.collect()[j][lettre_choisie] ))))
in the error message, the problem comes at this moment:
data_join_result.collect()[j][lettre_choisie]
My input:
VARIABLEA | VARIABLEB
BLUE | WHITE
PINK | DARK
My expected output:
VARIABLEA | VARIABLEB
BLTE | WHITE
PINK | DARM
If someone knows how to fix it! Thx
Collecting the data in driver is not advisable, also iterating through dataframe. Spark offers multiple api that allows us to perform our tasks in a parallelized manner. In your case, you can try these approaches:
For a single character replacement, try this (performance intensive) option
results:
you can see that oranges is corrupted as orangos. The chances of corruption will increase if you limit the alphabets to replace to just vowels.
If you don't need a one character replacement, try this:
Here you can have a little of control by controlling the loop iterations
results: