I'm trying to train seq2seq model(transformer) with pytorch and tensor2tensor. When using tensor2tensor, the batch size can be like 1024, while pytorch model shows CUDA out of memory error with 8 batch size.
Is there any technique used in tensor2tensor to make best use of memory.
If anyone know this, please tell me.
Thanks in advance.
In Tensor2Tensor by default, the batch size is specified in the number of tokens (subwords) per single GPU. This allows to use a higher number of short sequences (sentences) in one batch or a smaller number of long sequences. Most other toolkits use a fixed batch size specified in the number of sequences. Either way, it may be a good idea to limit the maximum sentence length in training to a reasonable number to prevent Out-of-memory errors and excessive padding. Some toolkits also prefer to specify the total batch size per all GPU cards.