OpenAI embeddings, cosine similarity yields implausible values

165 views Asked by At

I am currently experimenting with OpenAI's embeddings API. I would like to calculate semantic similarities of words or phrases. Here is a simple example:

var embedding1 = ChatGptApi.GetEmbedding("css");
var embedding2 = ChatGptApi.GetEmbedding("cascading style sheets");
var embedding3 = ChatGptApi.GetEmbedding("html");

var similarity12 = Similarity(embedding1, embedding2); // 0.85638954394501088
var similarity13 = Similarity(embedding1, embedding3); // 0.85686847408674782
var similarity23 = Similarity(embedding2, embedding3); // 0.80307937702974364

ChatGptApi.GetEmbedding() calls OpenAI with model text-embedding-ada-002. Why aren't the values reflecting the fact, that "CSS" and "Cascading Style Sheets" are the same thing, while "HTML" isn't? Why aren't the values closer to 1 to reflect the quite extreme semantic similarity? Am I misunderstanding something about the concept of OpenAI's embeddings? I've tried other examples as well, none of them yielding plausible results.

For reference, my cosine similarity function:

public static double Similarity(double[] vector1, double[] vector2, bool normalized = false)
{
    var product = .0;

    var norm1 = normalized ? 1.0 : .0;
    var norm2 = normalized ? 1.0 : .0;

    for (var i = 0; i < vector1.GetLength(0); i++)
    {
        product += vector1[i] * vector2[i];

        if (!normalized)
        {
            norm1 += vector1[i] * vector1[i];
            norm2 += vector2[i] * vector2[i];
        }
    }

    norm1 = Math.Sqrt(norm1);
    norm2 = Math.Sqrt(norm2);

    return product / (norm1 * norm2);
}
0

There are 0 answers