In the hugging face source code, pooled_output = outputs[1] is used.
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
Shouldn't it be pooled_output = outputs[0]? (This answer mentioning BertPooler seems to be outdated)
Based on this answer, it seems that the CLS token learns a sentence level representation. I am confused as to why/how masked language modelling would lead to the start token learning a sentence level representation. (I am thinking that BertForSequenceClassification freezes the Bert model and only trains the classification head, but maybe that's not the case)
Would a sentence embedding be equivalent or even better than the [CLS] token embedding?