Nearest neighbor search over Levenshtein distance in Python using metric indexing

1.6k views Asked by At

I would like to perform nearest-neighbor search over the Levenshtein distance with a dataset of texts in Python. The search can be approximate, should be implemented using core Python libraries, and queries should be near-constant-time operations. Below, I list some implementations that I considered together with their flaws.

Naïve brute-force approach

A naïve brute-force solution involves computing the pairwise Levenshtein distances between a query and all texts in a dataset:

$ pip install numpy python-Levenshtein
$ python
>>> import Levenshtein
>>> import numpy as np
>>>
>>> texts = ["Some string", "Some other string", "Yet another string"]
>>> query = "Smeo srting"
>>> knn = 2
>>>
>>> distances = [Levenshtein.distance(query, text) for text in texts]
>>> for idx in np.argsort(distances)[:knn]:
...     print(f'{texts[idx]}, {distances[idx]}')
...
Some string, 4
Some other string, 8

However, the naïve approach is quite slow. For M texts with maximum text length N, searching for the K nearest neighbors of a query is an O(M * N^2) operation. Finding the K nearest neighbors for each of the M texts is then an O(M^2 * N^2) operation.

Metric indexing

One solution that I considered is metric indexing. Metric indexing enables better-than-naïve speed for K-nearest-neighbor search in metric spaces. The indexed units do not need to be vectors, they just need to be equipped with a distance function, so that they form a metric space. Strings equipped with the Levenshtein distance form a metric space.

A nice property of metric learning is that it is implemented in core Python libraries, such as the implementation of the ball tree index in sklearn.neighbors.BallTree. With the ball tree, searching for the K nearest neighbors of a becomes an O(log M * N^2) operation. Finding the K nearest neighbors for each of the M texts is then an O(M log M * N^2) operation.

One drawback of the sklearn implementation is that it requires float vectors on the input. Therefore, a mapping scheme from strings to float vectors and back seems required. The solution can represent datasets of up to 2^53 strings, which seems sufficient:

$ pip install sklearn python-Levenshtein
$ python
>>> from typing import Optional
>>> import Levenshtein
>>> import numpy as np
>>> from sklearn.neighbors import BallTree, DistanceMetric
>>> 
>>> texts = ["Some string", "Some other string", "Yet another string"]
>>> query = "Smeo srting"
>>> knn = 2
>>> 
>>> def text_to_vector(text: str, idx: Optional[int] = None) -> np.ndarray:
...     if idx is None:
...         idx = len(texts)
...         texts.append(text)
...     return np.array([idx], dtype=np.float64)
... 
>>> def vector_to_text(idx: np.ndarray) -> str:
...     return texts[int(idx[0])]
... 
>>> def levdist(x: np.ndarray, y: np.ndarray) -> int:
...     x, y = map(vector_to_text, (x, y))
...     return Levenshtein.distance(x, y)
... 
>>> X = [text_to_vector(text, idx) for idx, text in enumerate(texts)]
>>> x = [text_to_vector(query)]
>>> 
>>> tree = BallTree(X, metric=DistanceMetric.get_metric('pyfunc', func=levdist))
>>> (dists,), (idxs,) = tree.query(x, k=knn)
>>> for dist, idx in zip(dists, idxs):
...    print(f'{texts[idx]}, {int(dist)}')
... 
Some string, 4
Some other string, 8

However, the speed still seems a major drawback. I would appreciate pointers to libraries that provide approximate K-nearest-neighbor search over the Levenshtein distance or over metric spaces in near-O(M * N^2) time.

0

There are 0 answers