Inception retraining issue "Nan in summary histogram for: HistogramSummary"

3k views Asked by At

I'm trying to retrain inceptionV3 on my RPi3. I'm getting this histogram error message.

python /home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py --bottleneck_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/bottlenecks --how_many_training_steps 500 --model_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/inception --output_graph=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_graph.pb --output_labels=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_labels.txt --image_dir /home/pi/Documents/Machine\ Learning/Inception/Retraining_Images
Looking for images in 'Granny Smith Apple'
Looking for images in 'Red Delicious'
100 bottleneck files created.
200 bottleneck files created.
2017-01-07 11:30:22.180768: Step 0: Train accuracy = 56.0%
2017-01-07 11:30:22.242166: Step 0: Cross entropy = nan
2017-01-07 11:30:22.850969: Step 0: Validation accuracy = 50.0%
Traceback (most recent call last):
  File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 887, in main
    ground_truth_input: train_ground_truth})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 915, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 985, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
     [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]

Caused by op u'HistogramSummary', defined at:
  File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 846, in main
    bottleneck_tensor)
  File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 764, in add_final_training_ops
    tf.histogram_summary(final_tensor_name + '/activations', final_tensor)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 100, in histogram_summary
    tag=tag, values=values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: HistogramSummary
     [[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]

I tried changing merged = tf.merge_all_summaries() in retrain.py after reading this but it didnt work.

Also, the first time I tried to retrain, I got different results for step 0 before hitting an error:

2017-01-07 11:13:36.548913: Step 0: Train accuracy = 89.0%
2017-01-07 11:13:36.555770: Step 0: Cross entropy = 0.590778
2017-01-07 11:13:37.052190: Step 0: Validation accuracy = 76.0%
2

There are 2 answers

3
Shanqing Cai On

Sounds like that it might help to know where the NaN values are coming from. For that, take a look at tensorflow debugger (tfdbg): https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/debugger/index.md

In your retrain.py, you can make a change like

from tensorflow.python import debug as tf_debug

# ... 
# In def main(_)
if debug:
  sess = tf_debug.LocalCLIDebugWrapperSession(sess)
  sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

# ...

Then when the sess.run() happens for the training and evaluation, you will drop into the command-line interface of the debugger. At the tfdbg> prompt, you can enter command to let the code run until any NaNs or Infinities appear in the TensorFlow graph:

tfdbg> run -f has_inf_or_nan

When the tensor filter has_inf_or_nan is hit, the interface will give you a list of Tensors containing Infs or Nans, sorted in time order. The one on the top should be the "culprit", i.e., the one that first generated the bad numerical values. Say its name is node_1, you can use the following tfdbg commands to look at its inputs and node attributes:

tfdbg> li -r node_1
tfdbg> ni -a node_1
0
Andrew Stromme On

If you're using tf.contrib.learn you'll want to use the following:

debug_hook = tf_debug.LocalCLIDebugHook()
debug_hook.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
hooks = [debug_hook]
...
classifier.fit(..., monitors=hooks)