I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
join datasets with tfx tensorflow transform
174 views Asked by DarioB At
1
There are 1 answers
Related Questions in APACHE-BEAM
- Api for video processing with Apache beam
- Reading CSV header with Dataflow
- BigqueryIO Unable to Write to Date-Partitioned Table
- Azure Blob support in Apache Beam?
- Consuming unbounded data in windows with default trigger
- How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?
- Read a file from GCS in Apache Beam
- Reading and Writing XML files through Apache Beam/Google Cloud DataFlow
- Multiple file generation while writing to XML through Apache Beam
- Unable to serialize com.google.api.services.bigquery.Bigquery$Tables
- Apache Beam Dataflow Jobs started failing with: Workflow failed
- What is a single bar in python?
- Download location for apache_beam.io.gcp.gcsio.GcsBufferedReader object
- Processing Total Ordering of Events By Key using Apache Beam
- Pick elements in processElement() - Apache Beam
Related Questions in TFX
- How do I call ExampleValidator to analyze split data sets?
- Why isn't SchemaGen supported in tfdv.display_schema()?
- How to make a custom metric available to TFMA/Beam?
- TFX. Properties for CsvCoder in CsvExampleGen: 'Columns do not match specified csv headers'
- TFX component CsvExampleGen always yields Examples with empty outputs (and inputs)
- Best practices to use .tfrecord files for forecasting
- How to Run a TFX Orchestration Pipeline Outside Jupyter?
- How to configure optional component with TFX?
- TFX TypeError: Argument input_params should be a Channel of type <class 'tfx.types.standard_artifacts.ExternalArtifact'> (got test_string)
- AttributeError: module 'tfx.utils.io_utils' has no attribute 'file_io'
- TFX pipeline-root not found
- Unable to use Sentence embeddings in Transform component (TFX)
- What does DataAccessor do in tfx?
- Add reserved tokens to `tft.vocabulary`
- How do you feed Ragged Tensors to a DNN trained by TensorFlow Extended?
Related Questions in TENSORFLOW-TRANSFORM
- Unable to use Sentence embeddings in Transform component (TFX)
- apache beam rows to tfrecord in order to GenerateStatistics
- Add reserved tokens to `tft.vocabulary`
- Transforming tensorflow datasets to beam datasets
- Problem with Tensorflow Transform(TFX) compute_and_apply_vocabulary/sparse_tensor_to_dense_with_shape
- How can i run my apache beam pipeline with a local CSV-File when using Tensorflow Extended?
- Can tf.transform handle viewfs:// path?
- Tensorflow Transform debug and iterative development best practices?
- Dealing with missing values in tensorflow
- What would be best practice for placing pre-processing and augmentation of images in a TFX pipeline?
- join datasets with tfx tensorflow transform
- Converting tokens to word vectors effectively with TensorFlow Transform
- How to send REST API request to Tensorflow Serving model with Sparse tensors?
- How to see all the possible options for schema metadata in tensorflow?
- Tensorflow - Convert timestamp to day of the week
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use
to_pcollectionto get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.For top-level functions (such as merge) one needs to do
and use operations
beam_pd.func(...)in place ofpd.func(...).