tensorflow scheduler question using GPU(mirrored STrategy)

35 views Asked by At

Hi I am trying to use learning rate scheduler. But it is not starting to train. (not starting iteration) I am trying to train imagenet with mirrored strategy because the data set is large. Also I have set to bring dataset as a batch form. for example, train_ds_preprocessed = train_ds.batch(256) and using 4 gpus. at training part which is fit, I have set batch as same as training datset batch which is 256.

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
initial_lr = 0.01
num_epochs = 100
warmup_steps = 5

def lr_schedule(epoch):
    print(" lr_schedule if first")
    if epoch < warmup_steps:
        print("if in ")
        print("epoch : ",epoch)
        print("warmup_steps : ",warmup_steps)
        print("return : ", (epoch + 1) / warmup_steps * initial_lr)
        return (epoch + 1) / warmup_steps * initial_lr
    else:
        print("else in ")
        return initial_lr * (1.0 - (epoch - warmup_steps) / (num_epochs - warmup_steps))


learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_schedule)



#%%
# strategy = tf.distribute.MirroredStrategy(devices = ["GPU:0", "GPU:1", "GPU:2", "GPU:3"])
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model_builder = MyModel(input_shape = input_image_shape, num_classes = num_classes, defined_out_channels = defined_out_channels)
    model = model_builder.build()
    model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
    model.summary()
# keras.utils.plot_model(model,"001_iamgenet_mymodel.png", show_shapes = True)
history = model.fit(
                    train_ds_preprocessed,
                    epochs=100,
                    batch_size=total_batch_size,
                    validation_data=val_ds_preprocessed,
                    callbacks=[checkpoint_callback, wandb_callbacks, learning_rate_scheduler]
                    ) 

the mesage stops at here


2023-08-26 07:18:12.707335: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:549] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
 lr_schedule if first
if in 
epoch :  0
warmup_steps :  5
return :  0.002
Epoch 1/100
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 77 all-reduces with algorithm = nccl, num_packs = 1
2023-08-26 07:18:35.281135: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:36.819182: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:37.407587: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-08-26 07:18:37.759992: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-26 07:18:38.781774: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301
2023-08-26 07:18:39.815530: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8301

is there any advice that you guys can give me ?

if I do not use the learning rate scheduler but instead if use

model.compile(optimizer = 'sgd', loss....)

then the iteration works perfectly. is there any advice?

0

There are 0 answers