Is there a way to calculate a distance metric (euclidean or cosine similarity or manhattan) between two homomorphically encrypted vectors?
Specifically, I'm looking to generate embeddings of documents (using a transformer), homomorphically encrypting those embeddings, and wanting to calculate a distance metric between embeddings to obtain document similarity scores.
I have evaluated libraries like concrete-numpy, TenSEAL, and Pyfhel (HE libraries) and each library appears to lack a specific mathematical operation be it division, cumulative sum, or absolute value that prevents generating any of the listed distance metrics above.
(I did find this: https://github.com/ibarrond/Pyfhel/blob/master/examples/Demo_8_HammingDist.py which calculates hamming distance between encrypted vectors, but this metric doesn't help with document similarity).
[Credit goes to ibarrond - answer found here: https://github.com/ibarrond/Pyfhel/issues/155]
There is indeed! You just need to rely on a few tricks to overcome the limitations of supported operations in CKKS/BFV schemes (mainly additions and multiplications):
Cosine Similarity: Formulated as CS(x, y) = (sum(xᵢ * yᵢ))/(||x|| * ||y||), it would require a division and a norm.
The trick: Normalize the vectors x' = x / ||x|| and y' = y / ||y||, encrypt x' and y' and perform a simple scalar product sum(xᵢ' * yᵢ') to obtain the cosine similarity (Check Demo_7_ScalarProd.py on how to do that).
Euclidean Distance: Formulated as ED(x, y) = sqrt( sum(xᵢ² - yᵢ²)), it would require a square root.
The trick: Settle with the Squared Euclidean Distance instead, where SED = sum(xᵢ² - yᵢ²)). In Pyfhel you can perform element-wise squaring natively (Demo_3), followed by Pyfhel.cumul_add for cumulative sum (some examples of this in Demo_7 and Demo_8).
Manhattan Distance: Formulated as MD(x, y) = sum(|xᵢ - yᵢ|), it would require computing the absolute value.
The trick: If you encrypt binary values only (that is, x̂, ŷ s.t. x̂ᵢ , ŷᵢ) ∈ {0,1} ∀ i), you can reformulate MD(x̂, ŷ) = sum((x̂ᵢ - ŷᵢ)²) = HD(x̂, ŷ), the Hamming Distance, which is implemented in Demo_8. For non-binary vectors you can at least compute the Squared Manhattan Distance SMD(x, y) = sum((xᵢ - yᵢ)²), missing a square root to get the MD.
General Tip: Leverage on the operations you can perform before encrypting (e.g., normalization) and after decrypting (e.g., square root of the result) to avoid computing non-linear functions in FHE!
Demos referenced can be found here: https://github.com/ibarrond/Pyfhel/tree/master/examples