Transforming tensorflow datasets to beam datasets

207 views Asked by At

There are a variety of ways to get a dataset you can train on in tensorflow. One of the things tensorflow transform does is provide the ability to do preprocessing via AnalyzeAndTransformDataset and TransformDataset. Surprisingly, the dataset being referred to is not a tensorflow dataset, but rather a dataset in the apache beam sense. That is understandable to some degree, given that the function is tft_beam.AnalyzeAndTransformDataset.

The heart of my question is this: given that the metadata is already known by tensorflow, why aren't there easier ways to get from a tensorflow dataset to a beam dataset. I understand that a tensorflow dataset will generally repeat itself forever, but is there a way to transform a tensorflow dataset to a dataset that can be processed by beam? Or is the only solution to have the beam dataset created by pointing to the original data on disk? Does this have to do with the unboundedness of a tensorflow dataset or is there some other reason that a tensorflow dataset cannot be analyzed/transformed through appropriate transformations so that it's abstracted from the developer?. All of the examples I have seen started with dictionaries, and there is another stack overflow question here that talks about this to some extent, but doesn't fully explain why this is the way it is.

1

There are 1 answers

0
chamikara On

This seems to be a question for Tensorflow team rather than Apache Beam, but TFX transforms you referred to are built on top of Beam transforms (so Beam is used as a utility). You are not directly working with Beam constructs (PColelctions, PTransforms etc.). If you want to build a Beam pipeline using the intermediate data, you might need to start with TFRecord files and use Beam's tfrecordio source as the other post mentioned.