How to concatenate + tokenize + pad strings in TFX preprocessing?

125 views Asked by At

I'd like to perform the usual text preprocessing steps in a TensorFlow Extended pipeline's Transform step/component. My data is the following (strings in independent features, 0/1 integers in label column):

field1 field2 field3 label
--------------------------
aa     bb     cc     0
ab     gfdg   ssdg   1 

import tensorflow as tf import tensorflow_text as tf_text from tensorflow_text import UnicodeCharTokenizer

def preprocessing_fn(inputs):
    
    outputs = {}
    outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
    outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)

    return outputs

but this doesn't work:

ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']

(Later on I want to apply char-level tokenization and padding to MAX_LEN as well).
Any idea?

0

There are 0 answers