Get attention masks from HF pipelines

104 views Asked by At

How should returned attention masks be accessed from the FeatureExtractionPipeline in Huggingface?

The code below takes an embedding model, distributes it and a huggingface dataset across 8 GPUs on a single node, and performs inference on the inputs. The code requires the attention masks for mean pooling.

Code example:

from accelerate import Accelerator
from accelerate.utils import tqdm
from transformers import AutoTokenizer, AutoModel
from optimum.bettertransformer import BetterTransformer

import torch

from datasets import load_dataset

from transformers import pipeline

accelerator = Accelerator()

model_name = "BAAI/bge-large-en-v1.5"

tokenizer = AutoTokenizer.from_pretrained(model_name,)

model = AutoModel.from_pretrained(model_name,)

pipe = pipeline(
    "feature-extraction",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    truncation=True,
    padding=True,
    pad_to_max_length=True,
    batch_size=256,
    framework="pt",
    return_tensors=True,
    return_attention_mask=True,
    device=(accelerator.device)
)

dataset = load_dataset(
    "wikitext",
    "wikitext-2-v1",
    split="train",
)

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Assume 8 processes

with accelerator.split_between_processes(dataset["text"]) as data:

    for out in pipe(data):

        sentence_embeddings = mean_pooling(out, out["attention_mask"])

I need the attention maks from pipe to use for mean pooling.

Best,

Enrico

1

There are 1 answers

1
druskacik On

The pipeline object from the transformers library provides a convenient abstraction for quick inference of models, but for more customized solutions it's usually a good idea to use the models directly. For example:

text = 'This is a test.'

tokenized = tokenizer(
    text,
    max_length=512,
    truncation=True,
    padding=True,
    return_attention_mask=True,
    return_tensors='pt').to(accelerator.device)

out = model(**tokenized)

embeddings = out.last_hidden_state
attention_mask = tokenized['attention_mask']

You can then use the embeddings and attention_mask to compute the mean pooling. You may also consider using out.pooler_output instead of computing the mean pooling manually, however, I am not sure how the pooler_output is computed in this case, so be wary.