I have a Spark code running on a Dataproc cluster that reads a table from BigQuery into a Spark dataframe. In this code, there is a step where I need to perform some data processing using pandas dataframe logic. However, when I try to convert the Spark dataframe to a pandas dataframe, I encounter an error that I am unable to resolve. It's worth noting that this code works fine on Hadoop without any issues.
I would appreciate any assistance or guidance in resolving this issue with the conversion from Spark dataframe to pandas dataframe.
df=df.toPandas()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py", line 141, in toPandas
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/frame.py", line 2317, in from_records
mgr = arrays_to_mgr(arrays, columns, result_index, typ=manager)
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 153, in arrays_to_mgr
return create_block_manager_from_column_arrays(
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2142, in create_block_manager_from_column_arrays
mgr._consolidate_inplace()
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1829, in _consolidate_inplace
self.blocks = _consolidate(self.blocks)
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2272, in _consolidate
merged_blocks, _ = _merge_blocks(
File "/opt/conda/default/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2297, in _merge_blocks
new_values = np.vstack([b.values for b in blocks]) # type: ignore[misc]
File "<__array_function__ internals>", line 180, in vstack
File "/opt/conda/default/lib/python3.8/site-packages/numpy/core/shape_base.py", line 282, in vstack
return _nx.concatenate(arrs, 0)
File "<__array_function__ internals>", line 180, in concatenate
numpy.core._exceptions.MemoryError: Unable to allocate 69.9 MiB for an array with shape (4289, 2137) and data type int64
The code I am working on processes a relatively small amount of data, with the maximum number of rows being around 10,000. However, the main challenge lies in the fact that the data source has over 4,000 columns. In this particular example, the code failed while processing 2,137 rows of data.
When converting
sparkdataframe topandasthe data will be no longer shared acrossdataprocnodes and be collected to the driver machine that's why you get memory error instead of converting your data to pandas try to use spark dataframe API to process data