Problem with Tensorflow Transform(TFX) compute_and_apply_vocabulary/sparse_tensor_to_dense_with_shape

550 views Asked by At

I have some problems running a Apache beam job on Dataflow. The code runs fine on a small dataset but when runing a bigger batch job on Dataflow i get the following message:

RuntimeError: InvalidArgumentError: indices[10925] = [889,43] is out of bounds: need 0 <= index < [1238,43] [[{{node transform/transform/SparseToDense}} = SparseToDense[T=DT_INT64, Tindices=DT_INT64, validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](transform/transform/StringSplit, transform/transform/SparseToDense/output_shape, transform/transform/compute_and_apply_vocabulary/apply_vocab/hash_table_Lookup, transform/transform/compute_and_apply_vocabulary/apply_vocab/string_to_index/hash_table/Const)]]

Taken from the log in Dataflow.

I have figured out that it is connected to my AnalyzeAndTransformDataset function:

def preprocessing_fn(inputs):
      words = tf.string_split(inputs['tweet'],DELIMITERS)
      int_representation =tft.compute_and_apply_vocabulary(words,default_value=0,top_k=10000)
      # The shape out here is the problem, i think

      int_representation = 
      tft.sparse_tensor_to_dense_with_shape(int_representation, 
        [None,43])
      outputs = inputs
      outputs["int_representation"] = int_representation 
      return outputs

The goal is to get a dense output vector of length 43 out for each example. This will be used for Sentimental Analysis on Twitter data as a home project :)

Thanks for any help!

UPDATE:

I didn't manage to solve to the problem but instead converted the vector to a dense vector inside my Tensorflow estimator model. Not the solution I had hoped for but it works! Have also reported it to Tensorflow transforms on GitHub.

0

There are 0 answers