I’m trying to perform a slightly complex hyperparameter tuning operation in Databricks on a Tensorflow model (though the complexity comes from how many different tools we’re trying to make work together, not specifically anything about the model training itself). The problem arises between hyperopt, the library we’re using, and the dataset, which is a TFRecordDataset. Specifically, if we want to use SparkTrials to distribute the separate tuning runs across the different worker nodes in the cluster, hyperopt needs to pickle the model, the dataset, the hyperparameters, and anything else defined in the closure of the objective function. That all gets pickled on the driver node, then it’s unpickled on the worker node, and then the training runs, etc. etc. So, everything works great and pickles just fine except for the dataset, because Tensorflow datasets aren’t picklable, at least to my knowledge. The error it gives is always along the lines of not being able to convert a tensor of dtype variant to a numpy array.
I guess what I’m trying to figure out is if there’s any sort of workaround that I can rig up. The dataset is about 8,000 images (but the solution needs to scale, because that could grow) with a few additional corresponding data fields for each image plus the target variable.
Here’s a few of the things that I’ve been thinking. I've ventured down each of these paths a bit with not too much success (and/or only to find that the solution would likely be somewhat involved for each, so I wanted to get advice before diving in head first into something that might not pan out.
Convert everything to numpy arrays. I’m not sure how I’d even do this since np arrays live in memory and such a large dataset wouldn’t necessarily fit (and even if this one did, wouldn’t be able to guarantee scalability).
Modify the fmin function and/or any relevant classes from the hyperopt library to ignore the unpicklable items in the objective function, and then when it would be time to unpickle those items on the worker node, instantiate them. I have no idea if this is even feasible, but if it were, I think it’d be a few notches above my paygrade.
Abandon distributed hyperparameter tuning in favor of merely distributed model training with serial hyperparameter runs. I’ve actually already tried this as well using Tensorflow’s mirrored distribution strategy, and it gives the same error about variant tensors not converting to numpy arrays. However, there’s a library called Horovod that I think is also built in to Databricks that I haven’t tried yet. I’m not necessarily hopeful about it though; I don’t know how it would be any different. I’ve tried broadcasting the datasets in the spark context and that also gives the same error, so I suspect the dataset really doesn’t want to be used for distributed anything.
Thoughts? Considerations? Things I haven’t tried? Big misconceptions I seem to have?