I have some problems running a Apache beam job on Dataflow. The code runs fine on a small dataset but when runing a bigger batch job on Dataflow i get the following message:
RuntimeError: InvalidArgumentError: indices[10925] = [889,43] is out of bounds: need 0 <= index < [1238,43] [[{{node transform/transform/SparseToDense}} = SparseToDense[T=DT_INT64, Tindices=DT_INT64, validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](transform/transform/StringSplit, transform/transform/SparseToDense/output_shape, transform/transform/compute_and_apply_vocabulary/apply_vocab/hash_table_Lookup, transform/transform/compute_and_apply_vocabulary/apply_vocab/string_to_index/hash_table/Const)]]
Taken from the log in Dataflow.
I have figured out that it is connected to my AnalyzeAndTransformDataset function:
def preprocessing_fn(inputs):
words = tf.string_split(inputs['tweet'],DELIMITERS)
int_representation =tft.compute_and_apply_vocabulary(words,default_value=0,top_k=10000)
# The shape out here is the problem, i think
int_representation =
tft.sparse_tensor_to_dense_with_shape(int_representation,
[None,43])
outputs = inputs
outputs["int_representation"] = int_representation
return outputs
The goal is to get a dense output vector of length 43 out for each example. This will be used for Sentimental Analysis on Twitter data as a home project :)
Thanks for any help!
UPDATE:
I didn't manage to solve to the problem but instead converted the vector to a dense vector inside my Tensorflow estimator model. Not the solution I had hoped for but it works! Have also reported it to Tensorflow transforms on GitHub.