I have a table with 36 millions rows and I need to run different statistical analysis (e.g. hypothesis tests, distribution analysis etc.).
Since I get a memory error when I use the export_to_pandas
method I need to read the data either chunks wise or which is my preferred option use a dask data frame.
However after several tries I did no managed to import the table from the exasol database into a dask dataframe.
How the code could look like?
Even the chunkwise code did not worked:
import pyexasol
# Connect to Exasol
connection = pyexasol.connect(dsn="localhost:8563", user="your_user", password="your_password", schema="your_schema")
# Define your SQL query
sql_query = "SELECT * FROM your_table"
# Set the chunk size for fetching data
chunk_size = 1000
# Execute the query and fetch data in chunks
with connection.cursor() as cursor:
cursor.execute(sql_query)
while True:
# Fetch data in chunks
data_chunk = cursor.fetchall(chunk_size=chunk_size)
# Break the loop if no more data is available
if not data_chunk:
break
# Process the data (replace this with your actual data processing logic)
for row in data_chunk:
print(row)
# Close the connection
connection.close()
EXPORT TO LOCAL file command will hardly fail with out of memory. Therefore, I presume the error is thrown during conversion of a CSV file to a Pandas dataframe: https://github.com/exasol/pyexasol/blob/master/pyexasol/callback.py#L42
Therefore, one may try to tune it via Pandas-specific parameters (
low_memory
,chunksize
etc.) passing them via**kwargs
toexport_to_pandas
likeOne more option - try parallel export and combine the dataframes.