How to use large table from exasol database for statistical analysis in Python?

104 views Asked by At

I have a table with 36 millions rows and I need to run different statistical analysis (e.g. hypothesis tests, distribution analysis etc.).

Since I get a memory error when I use the export_to_pandas method I need to read the data either chunks wise or which is my preferred option use a dask data frame.

However after several tries I did no managed to import the table from the exasol database into a dask dataframe.

How the code could look like?

Even the chunkwise code did not worked:

import pyexasol

# Connect to Exasol
connection = pyexasol.connect(dsn="localhost:8563", user="your_user", password="your_password", schema="your_schema")

# Define your SQL query
sql_query = "SELECT * FROM your_table"

# Set the chunk size for fetching data
chunk_size = 1000

# Execute the query and fetch data in chunks
with connection.cursor() as cursor:
    cursor.execute(sql_query)
    
    while True:
        # Fetch data in chunks
        data_chunk = cursor.fetchall(chunk_size=chunk_size)
        
        # Break the loop if no more data is available
        if not data_chunk:
            break
        
        # Process the data (replace this with your actual data processing logic)
        for row in data_chunk:
            print(row)

# Close the connection
connection.close()
1

There are 1 answers

0
Alexander Lipatiev On

EXPORT TO LOCAL file command will hardly fail with out of memory. Therefore, I presume the error is thrown during conversion of a CSV file to a Pandas dataframe: https://github.com/exasol/pyexasol/blob/master/pyexasol/callback.py#L42

Therefore, one may try to tune it via Pandas-specific parameters (low_memory, chunksize etc.) passing them via **kwargs to export_to_pandas like

C.export_to_pandas('users', low_memory=False)

One more option - try parallel export and combine the dataframes.