I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help.
libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:504] CHECK failed: (value.size()) <= (kint32max):
Segmentation fault
Based on one of the suggestion from this, I reduced the dataset to half but it causes another error:
Traceback (most recent call last):
File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 175, in <module>
train_tensorflow(data_dir, "/home/hadoop/temp_model/", learning_rate, dropout_rate, epochs)
File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 165, in train_tensorflow
model.fit(dataset, epochs=epochs)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
result = self._call(*args, **kwds)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0] = -1 is not in [0, 6806177)
[[{{node embedding_lookup_1}}]]
[[StatefulPartitionedCall]]
[[IteratorGetNext]] [Op:__inference_train_function_715]
Function call stack:
train_function -> train_function -> train_function