Compare two strings by meaning using LLMs

1.7k views Asked by At

I'd like to use some of the good large language models to estimate how similar the meanings of two strings are, for example "cat" and "someone who likes to play with yarn", or "cat" and "car".

Maybe some libraries provide a function for comparing strings, or we could implement some method such as measuring the similarity of their embeddings in a deep layer or whatever is appropriate.

I hope that something without much boilerplate code is possible. Something like:

import language_models, math
my_llm = language_models.load('llama2')
print(math.dist(
    my_llm.embedding('cat'),
    my_llm.embedding('someone who likes to play with yarn')))

Ideally, it should be easy to try different recent LLMs. (In the "example" above, that would mean replacing 'llama2' by another model name.)

2

There are 2 answers

4
Nikhil S On

Spacy is the way:

import spacy

# Load the language model
nlp = spacy.load("en_core_web_sm")

# Example sentences
sentence1 = "I love coding "
sentence2 = "I love studying"

# Process the sentences using spaCy
doc1 = nlp(sentence1)
doc2 = nlp(sentence2)

# Compute the similarity between the two sentences
similarity = doc1.similarity(doc2)

print(f"Similarity between the two sentences: {similarity}")

Output:

Similarity between the two sentences: 0.8791585322649781

The similarity value will be a number between 0 and 1, where 1 means the sentences are exactly the same, and 0 means they have no similarity. This value can give you an idea of how similar the meanings of the two sentences are.

0
Seon On

Though it lacks some of the latest models, you could give Hugging Face's sentence-transformers package a try. The models it provides are all specifically trained for embedding comparison.

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.

You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similarity, semantic search, or paraphrase mining.

The underlying model is easy to switch out, and HuggingFace provides a few thousands options to pick from.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-distilroberta-v1")

query_embedding = model.encode("cat")
passage_embedding = model.encode(["car", "someone who likes to play with yarn"])

# Results aren't great with the all-distilroberta-v1 model
print("Similarity:", util.cos_sim(query_embedding, passage_embedding))