I am building a GRU-based architecture. Before, I was just padding the batches of sequences and passing it to the GRU. Obviously, that was introducing some small error in the results because it's not quite the 100% correct thing to do (the GRU doesn't know to stop when it reaches the padding elements).
Thus I switched out the naive batch of 2d padded sequences for pack_padded_sequence, so that I'm not passing extraneous padding items to the GRU. The training time increased by at least 3x. I am doing the pack_padded_sequence on GPU, so I need to check if perhaps it's just inefficient to do on GPU.
Any suggestions would be appreciated!