I have a training dataset stored on S3 in parquet format. I wish to load this data into a notebook (on databricks cluster) and train a Keras model on it. There are few ways that I can think of to train Keras model on this dataset:
- read parquet file from S3 in batches (maybe using Pandas) and feed these batches to the model
- using Tensorflow IO APIs (this might require to copy parquet from S3 to local env on notebook)
- using Petastorm package (from Uber) - this also might require to copy parquet from S3 to local notebook's environment
What is the best way to train a model in such case, such that it would be easier to scale the training to larger training datasets?