I am training the coarse-to-fine coreference model (for some other language than English) from Allennlp with template configs from bert_lstm.jsonnet. When I replace the type “lstm” of the context layer with “gru”, it works, but seems to have very little impact on training. The same 63 GB of RAM are consumed each epoch, validation f1-score is hovering around the same value. Is this change in config actually replace Bi-LSTM layer with Bi-GRU layer, or am I missing something?
"context_layer": {
"type": "gru",
"bidirectional": true,
"hidden_size": gru_dim,
"input_size": bert_dim,
"num_layers": 1
},
It would take some experimentation to be sure, but I assume what's going on is that everything happens inside of BERT (your embedder), and the
context_layer
does very little regardless of whether it's GRU or LSTM. If you take a look at the similar SpanBERT config, the context layer there is actually just pass-through.Similar for memory: Most of the memory is consumed by BERT. The context layer contributes little to the memory consumption.