Second call to tf.estimator.train_and_evaluate finished after 1 step of training

94 views Asked by At

I am running two tensorflow models using the tf.estimator.train_and_evaluate function one after the other.

# The first
train_spec = tf.estimator.TrainSpec(input_fn=my_input_fn("train"), max_steps=max_steps)
eval_spec = tf.estimator.EvalSpec(input_fn=my_input_fn("valid"), steps=None)
tf.estimator.train_and_evaluate(model1, train_spec, eval_spec)

# the second
hook = [tf.estimator.StopAtStepHook(num_steps=max_steps)]
train_spec = tf.estimator.TrainSpec(input_fn=my_input_fn("train"), max_steps=None, hooks=hook)
eval_spec = tf.estimator.EvalSpec(input_fn=my_input_fn("valid"), steps=None)
tf.estimator.train_and_evaluate(model2, train_spec, eval_spec)

The first trains OK but the second trains for only 1 step:

...

INFO:tensorflow:Saving dict for global step 1: LogLoss = 0.06514542, PR_AUC = 0.012231247, ROC_AUC = 0.52047175, global_step = 1, label/mean = 0.011529858, loss = 0.06514542, prediction/mean = 0.016156415

...

INFO:tensorflow:Loss for final step: 0.32117385.

I tried to run two models sequentially on the same dataset using tf.estimator.train_and_evaluate. I expect both of the trainings run similarly. However, the second training runs for only 1 step and finishes.

Solution: I was using tf.estimator.inputs.numpy_input_fn (https://docs.w3cub.com/tensorflow~1.15/estimator/inputs/numpy_input_fn) in the second train_spec input_fn and when num_epochs=1 or not used as an argument, it terminates early. I changed to num_epochs=None and the early termination issue is solved. For this solution, max_steps=None should be set in the second training and stopping hook should be used. Additionally, early stopping hook should not be used in the second training in my case. However, this forces second training to continue for the specified number of epochs.

1

There are 1 answers

2
Ergun Bicici On

Solution: I was using tf.estimator.inputs.numpy_input_fn (https://docs.w3cub.com/tensorflow~1.15/estimator/inputs/numpy_input_fn) in the second train_spec input_fn and when num_epochs=1 or not used as an argument, it terminates early. So, in the solution, three things should be done:

  1. I changed to num_epochs=None in numpy_input_fn and the early termination issue is solved.
  2. For this solution, max_steps=None should be set in the second training and stopping hook should be used.
  3. Additionally, early stopping hook should not be used in the second training in my case. However, this forces second training to continue for the specified number of epochs.