I am trying to fine-tune BERT model for NER tagging task using tensorflow official nlp toolkit. I found there's already a bert token classifier class which i wanted to use. Looking at the code inside, I don't see any masking to prevent tag prediction and loss calculation for paddings and [SEP] token. I think the prevention is possible, just I don't know how? I wanted to prevent this for faster training and also one of the blog mentioned some weird behaviour when not masked.
Anybody has any idea about this?
Have you found a solution? I'm doing the same task and I found the PADDING TOKEN is dominating the prediction. Passing in an attention mask didn't do anything so I manually chopped down the sequences to just 100 tokens long, and it improves.