I am building a streamlit application where ydata-profiling gets executed when button is pressed. This process is done in a separate server like a microservice. The goal here is to run the process asynchronously so that when multiple users click the button at the same time, they don't have to wait for each other's tasks to complete. The asynchronous part works, but it sometimes returns ValueError('Argument must be an image, collection, or ContourSet in this Axes'). Through the research I found that this is an issue of using matplotlib with multiple threads. https://github.com/matplotlib/matplotlib/issues/4823 (similar problem)
Is there a way to make this work with ydata-profiling?

MRE Run this with python -m uvicorn minimal_server:app --reload --port 8383

from fastapi import FastAPI, HTTPException
import pandas as pd
from ydata_profiling import ProfileReport
import json
import os
from pydantic import BaseModel
import logging
import logging.config
import asyncio

logger = logging.getLogger(__name__)

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Hello World"}

class Data(BaseModel):
    filename: str
    username: str

async def async_read_csv(file_path):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, pd.read_csv, file_path)

async def async_to_file(profile, profile_path):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, profile.to_file, profile_path)

@app.post("/profile/")
async def profile_file(data: Data):
    logger.info("Data: ", data)

    try:
        file_dir = "datasets"
        file_path = os.path.join(file_dir, data.username, data.filename)
        logger.info(f'FILEPATH: {file_path}')
        
        if not os.path.exists(file_path):
            raise HTTPException(status_code=404, detail="File not found")

        df = await async_read_csv(file_path)
        print("File read")

        # profile_path = await run_profile(df, data.username, data.filename)

        profile = ProfileReport(df, pool_size=2)
        
        if not os.path.exists(os.path.join("reports",data.username)):
            os.makedirs(os.path.join("reports",data.username), exist_ok=True)
        profile_path = os.path.join("reports", data.username, data.filename.split('.')[0]) + ".html"
        print("Starting to process...")
        a = await async_to_file(profile, profile_path)
        print("End processing...")

        return {"profile_path": profile_path}

    except Exception as e:
        logger.error("Exception: ", e)
        return {"error": str(e)}

Error can be reproducing by executing the below python file in multiple terminals at once or by multiple processes

import requests
import uuid 
import pandas as pd
import os

def main():
    url = f"http://127.0.0.1:8383/profile/"
    filename = f'{uuid.uuid4()}.csv'
    username = "john"

    print("Creating dataframe...")
    df = pd.DataFrame({'a':[1,2,3,4,5,6,7]*1000000,'b':[1,2,3,4,5,6,7]*1000000,'c':[1,2,3,4,5,6,7]*1000000})

    print("Saving DataFrame...")
    if not os.path.exists(os.path.join("datasets", username)):
        os.makedirs(os.path.join("datasets", username), exist_ok=True)
    df.to_csv(os.path.join("datasets", username, filename),index=False)

    payload = {
        "filename": filename, 
        "username": username,
    }

    print("Sending request...")
    try:
        response = requests.post(url, json=payload)
        print(response.json()['profile_path'])
        """
        ## profile_path is then used to load and display in streamlit
        loaded_report = open(response.json()['profile_path'], 'r', encoding='utf-8')
        source_code = loaded_report.read() 
        components.html(source_code, height = 1500, scrolling=True)
        """
    except Exception as e:
        print(f"Response not expected: {e} {response.text}")
if __name__ == "__main__":
    main()
1

There are 1 answers

1
Navi On

The issue you're facing with ValueError('Argument must be an image, collection, or ContourSet in this Axes') is related to the fact that Matplotlib is not thread-safe, and when running in a multi-threaded environment, it can lead to problems.

One possible solution is to use Matplotlib's FigureCanvasAgg for rendering images in a non-interactive backend. This can help avoid the threading issues when running Matplotlib in a multi-threaded environment. You can try updating your async_to_file function to use this approach:

import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg

async def async_to_file(profile, profile_path):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, save_report, profile, profile_path)

def save_report(profile, profile_path):
    report = profile.to_file(profile_path)
    
    # Force Matplotlib to use the Agg backend to avoid threading issues
    fig, ax = plt.subplots()
    FigureCanvasAgg(fig)
    
    # Render the report (you might need to adjust this based on your ydata-profiling version)
    report.to_notebook_iframe()

    plt.close(fig)
    return profile_path

This code ensures that Matplotlib is configured to use the Agg backend before rendering the ydata-profiling report. Note that the to_notebook_iframe method is used to render the report; you might need to adjust this part based on your ydata-profiling version.

Make sure to test this modification in your environment to ensure it resolves the threading issue. If you still encounter problems, consider checking if there are updates or specific configurations in ydata-profiling that can help mitigate these threading issues.