Im currently using models like RoBERTa, CodeBERT etc for "code author identification/ code detection" (you can imagine it like facial recognition task). I know they are encoder architectures.
I use these encoders and train a siamese network with pairs of data and use contrastive loss to maximise the distance between the non similar pairs and decrease the distance/loss for similar pairs and then finally use the fine-tuned encoder to generate embedding vectors for unseen code and use these vectors to assign an author by comparing with author embeddings generated with set of author samples of finetuned model. It works similar to the face detection task.
Since my project is mainly focused on embeddings I am not sure how decoder can be used for this as decoder generally is used for generating an output sequence rather than encoding an input.I want to know
- If a decoder architectures like say "mistral" or similar llm's can be used to generate embeddings for my task? If so, can anyone guide me on how to achieve this as I don;t understand how to use a decoder for this purpose.
PS: I only read about the decoder architectures and got more confused, so seeking some support here.