I have a dataset consisting of 100000 samples.
I need to split this dataset into 100 subsets and for each subset train a ML model. Since the trained models are independent, it's easy to parallelize this part doing something like
from dask import compute, delayed
from sklearn.linear_model import Lasso
X, y = load_data()
n_windows = 100
model = Lasso()
results = []
for i in range(0, len(X), n_windows):
results.append(delayed(model.fit)(X, y))
results = compute(results)
But say the model itself needs to spawn processes for example if the model is a pipeline which contains a cross validation like GridSearchCV
or HyperBandSearchCV
.
How does it work then?
How should I parallelize this code?
It's not clear to me how to make it work, especially if I use sklearn
estimators like GridSearchCV
or ColumnTransformer
which use joblib
instead of dask
to parallelize computations.
In fact, depending on whether I use a Client
or not like so:
from dask.distributed import Client
client = Client()
and depending of whether this instantiated client
is created in the main script or imported from a different module I get either a warning or an error.
In the first case the code is successfully executed but I get a warning saying:
Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
In the second case the code is never finished, the interpreter stack and I get this error:
daemonic processes are not allowed to have children
Any help how to tackle this problem would be much appreciated. Thank you
Take a look at Dask ML, it has much of what you need.