Nested processes with Dask and Machine learning models

257 views Asked by At

I have a dataset consisting of 100000 samples.

I need to split this dataset into 100 subsets and for each subset train a ML model. Since the trained models are independent, it's easy to parallelize this part doing something like

from dask import compute, delayed
from sklearn.linear_model import Lasso

X, y = load_data()
n_windows = 100
model = Lasso()

results = []
for i in range(0, len(X), n_windows):
    results.append(delayed(model.fit)(X, y))
    
results = compute(results)

But say the model itself needs to spawn processes for example if the model is a pipeline which contains a cross validation like GridSearchCV or HyperBandSearchCV.

How does it work then? How should I parallelize this code? It's not clear to me how to make it work, especially if I use sklearn estimators like GridSearchCV or ColumnTransformer which use joblib instead of dask to parallelize computations.

In fact, depending on whether I use a Client or not like so:

from dask.distributed import Client
client = Client()

and depending of whether this instantiated client is created in the main script or imported from a different module I get either a warning or an error.

In the first case the code is successfully executed but I get a warning saying:

Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1

In the second case the code is never finished, the interpreter stack and I get this error:

daemonic processes are not allowed to have children

Any help how to tackle this problem would be much appreciated. Thank you

1

There are 1 answers

0
Brian Larsen On

Take a look at Dask ML, it has much of what you need.