pyspark java heap space error when just retrieving data frame columns

Question

pyspark java heap space error when just retrieving data frame columns

132 views Asked by zhh210 At 12 February 2021 at 07:15

Is spark's lazy evaluation really executing anything for the following simple example of pointing to a partition of a hive table and getting columns but nothing really heavy:

>>> spark.sql('select * from default.test_table where day="2021-01-01"').columns
[Stage 0:===============================>                   (1547 + 164) / 2477]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 28049"...
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o61.sql

I don't see why just pointing to a hive table takes much memory from PySpark (Version 2.4.3). Adding memory to driver and executor (driver-memory, executor-memory) only makes the query stuck forever without outputting any useful message. Is there a way to suppress PySpark from executing when just defining a data frame?

Original Q&A

There are 1 answers

**Shadowtrooper** · Answer 1 · 2021-02-12T11:46:03+00:00

Shadowtrooper On 12 February 2021 at 11:46

You can put a limit on the query to avoid memory errors:

spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns

TechQA.

pyspark java heap space error when just retrieving data frame columns

There are 1 answers

Related Questions in PYSPARK

Related Questions in JAVA-HEAP

Popular Questions

Popular Tags

Trending Questions