Good strategy training a ML model directly using data from a HDFS

228 views Asked by At

I want to train a model on a compute node but using the data (parquet format) from a storage cluster (HDFS). And I cannot copy-paste the whole dataset from HDFS onto my compute node. What would be a workable solution for this (I use python)?

I did some research and it seems Petastorm is a promising solution.

However, I came across another post saying that, quote,

The recommended workflow is:

Use Apache Spark to load and optionally preprocess data.

Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.

Feed data into a DL framework for training or inference.

I'm not sure the reason that I need PySpark here. So I'm wondering if anyone knows why? And if anyone had done similar use-case, could u please also share your solution? Thanks in advance!

1

There are 1 answers

0
OneCricketeer On

If the documentation says it can use Spark dataframes, then yes, that would imply PySpark.

(Py)Spark itself has machine learning algorithms, however.

anyone knows why?

Exactly what you said - you cannot load your training dataset directly into one node.