Pyspark (Pandas on Spark) OOM Error with Series.apply()

19 views Asked by At

I'm running some transformations on a relatively small dataset (2gb). I'm using a spark pool with 2 executors with 8 cores that have 56gb memory each, which should be overkill. The dataset is being partitioned in 20 pieces, which I think makes sense. However, I am consistently getting OOM errors (java.lang.OutOfMemoryError: unable to create new native thread) when applying functions to the dataframe or series from the dataframe. For example,

def get_date(x):
    try:
        std_date = dparser.parse(x, fuzzy=True)
        return std_date.strftime('%Y-%m-%d')
    except Exception as e:
        return None

df[std_date_colname] = df['og_date'].apply(get_date)

Returns an error.

As I have been playing around with different configurations, some runs are calling DataFrame.to_pandas(), which understandably causes an OOM error, but I don't understand why to_pandas() is being called in the first place.

I have not been able to find any documentation for why DataFrame.apply() or Series.apply() would cause spillage like this. Because some runs seem to be shuffling, I have tried increasing spark.executor.memoryOverheadFactor to allow more room for shuffle, but it has not helped.

Any help/insight would be appreciated.

Thank you!

0

There are 0 answers