I'm trying to retrain inceptionV3 on my RPi3. I'm getting this histogram error message.
python /home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py --bottleneck_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/bottlenecks --how_many_training_steps 500 --model_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/inception --output_graph=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_graph.pb --output_labels=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_labels.txt --image_dir /home/pi/Documents/Machine\ Learning/Inception/Retraining_Images
Looking for images in 'Granny Smith Apple'
Looking for images in 'Red Delicious'
100 bottleneck files created.
200 bottleneck files created.
2017-01-07 11:30:22.180768: Step 0: Train accuracy = 56.0%
2017-01-07 11:30:22.242166: Step 0: Cross entropy = nan
2017-01-07 11:30:22.850969: Step 0: Validation accuracy = 50.0%
Traceback (most recent call last):
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 887, in main
ground_truth_input: train_ground_truth})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
Caused by op u'HistogramSummary', defined at:
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 846, in main
bottleneck_tensor)
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 764, in add_final_training_ops
tf.histogram_summary(final_tensor_name + '/activations', final_tensor)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 100, in histogram_summary
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
I tried changing merged = tf.merge_all_summaries()
in retrain.py
after reading this
but it didnt work.
Also, the first time I tried to retrain, I got different results for step 0 before hitting an error:
2017-01-07 11:13:36.548913: Step 0: Train accuracy = 89.0%
2017-01-07 11:13:36.555770: Step 0: Cross entropy = 0.590778
2017-01-07 11:13:37.052190: Step 0: Validation accuracy = 76.0%
Sounds like that it might help to know where the NaN values are coming from. For that, take a look at tensorflow debugger (tfdbg): https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/debugger/index.md
In your retrain.py, you can make a change like
Then when the
sess.run()
happens for the training and evaluation, you will drop into the command-line interface of the debugger. At thetfdbg>
prompt, you can enter command to let the code run until any NaNs or Infinities appear in the TensorFlow graph:When the tensor filter
has_inf_or_nan
is hit, the interface will give you a list of Tensors containing Infs or Nans, sorted in time order. The one on the top should be the "culprit", i.e., the one that first generated the bad numerical values. Say its name isnode_1
, you can use the following tfdbg commands to look at its inputs and node attributes: