I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help.

libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:504] CHECK failed: (value.size()) <= (kint32max):
Segmentation fault

Based on one of the suggestion from this, I reduced the dataset to half but it causes another error:

Traceback (most recent call last):
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 175, in <module>
    train_tensorflow(data_dir, "/home/hadoop/temp_model/", learning_rate, dropout_rate, epochs)
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 165, in train_tensorflow
    model.fit(dataset, epochs=epochs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:    indices[0] = -1 is not in [0, 6806177)
     [[{{node embedding_lookup_1}}]]
     [[StatefulPartitionedCall]]
     [[IteratorGetNext]] [Op:__inference_train_function_715]

Function call stack:
train_function -> train_function -> train_function
0

There are 0 answers