Pyspark stuck when converting rdd to dataframe

59 views Asked by At

I got a freezing status when I tried to run very simple pyspark codes (Spark 3.5.0, Apple Silicon M1 Pro Chip). It used to work well with Spark 3.4.2. I thought it was version compatibility issue and I tried many different combinations (Python 3.9 + Spark 3.4.2, Python3.10 + Spark 3.4.2, Python3.9+Spark3.5.0, etc.).

enter image description here

I checked Spark UI executor and then jstack. The jstack shows something is blocked between python and java. By the way, I tried the same code in spark-shell and SparkR, both works well.

"Idle Worker Monitor for python3" #64 daemon prio=5 os_prio=31 tid=0x0000000125051800 nid=0xf303 waiting for monitor entry [0x00000001769d2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:328)
    - waiting to lock <0x00000007aea31be0> (a org.apache.spark.api.python.PythonWorkerFactory)

"Executor task launch worker for task 0.0 in stage 0.0 (TID 0)" #63 daemon prio=5 os_prio=31 tid=0x0000000123c98800 nid=0x5107 runnable [0x00000001767c5000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:224)
    at java.io.DataInputStream.readInt(DataInputStream.java:387)
    at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:127)
    at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:143)
    at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:142)
    - locked <0x00000007aea31be0> (a org.apache.spark.api.python.PythonWorkerFactory)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    - locked <0x00000007808be028> (a org.apache.spark.SparkEnv)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.executor.Executor$TaskRunner$$Lambda$1352/1057379762.apply(Unknown Source)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

0

There are 0 answers