pyspark java heap space error when just retrieving data frame columns

144 views Asked by At

Is spark's lazy evaluation really executing anything for the following simple example of pointing to a partition of a hive table and getting columns but nothing really heavy:

>>> spark.sql('select * from default.test_table where day="2021-01-01"').columns
[Stage 0:===============================>                   (1547 + 164) / 2477]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 28049"...
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o61.sql

I don't see why just pointing to a hive table takes much memory from PySpark (Version 2.4.3). Adding memory to driver and executor (driver-memory, executor-memory) only makes the query stuck forever without outputting any useful message. Is there a way to suppress PySpark from executing when just defining a data frame?

1

There are 1 answers

1
Shadowtrooper On

You can put a limit on the query to avoid memory errors:

spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns