I'd like to perform the usual text preprocessing steps in a TensorFlow Extended pipeline's Transform step/component. My data is the following (strings in independent features, 0/1 integers in label column):
field1 field2 field3 label
--------------------------
aa bb cc 0
ab gfdg ssdg 1
import tensorflow as tf import tensorflow_text as tf_text from tensorflow_text import UnicodeCharTokenizer
def preprocessing_fn(inputs):
outputs = {}
outputs['features_xf'] = tf.sparse.concat(axis=0, sp_inputs=[inputs["field1"], inputs["field2"], inputs["field3"]])
outputs['label_xf'] = tf.convert_to_tensor(inputs["label"], dtype=tf.float32)
return outputs
but this doesn't work:
ValueError: Arrays were not all the same length: 3 vs 1 [while running 'Transform[TransformIndex0]/ConvertToRecordBatch']
(Later on I want to apply char-level tokenization and padding to MAX_LEN
as well).
Any idea?