I would like to perform nearest-neighbor search over the Levenshtein distance with a dataset of texts in Python. The search can be approximate, should be implemented using core Python libraries, and queries should be near-constant-time operations. Below, I list some implementations that I considered together with their flaws.
Naïve brute-force approach
A naïve brute-force solution involves computing the pairwise Levenshtein distances between a query and all texts in a dataset:
$ pip install numpy python-Levenshtein
$ python
>>> import Levenshtein
>>> import numpy as np
>>>
>>> texts = ["Some string", "Some other string", "Yet another string"]
>>> query = "Smeo srting"
>>> knn = 2
>>>
>>> distances = [Levenshtein.distance(query, text) for text in texts]
>>> for idx in np.argsort(distances)[:knn]:
... print(f'{texts[idx]}, {distances[idx]}')
...
Some string, 4
Some other string, 8
However, the naïve approach is quite slow. For M texts with maximum text length N, searching for the K nearest neighbors of a query is an O(M * N^2) operation. Finding the K nearest neighbors for each of the M texts is then an O(M^2 * N^2) operation.
Metric indexing
One solution that I considered is metric indexing. Metric indexing enables better-than-naïve speed for K-nearest-neighbor search in metric spaces. The indexed units do not need to be vectors, they just need to be equipped with a distance function, so that they form a metric space. Strings equipped with the Levenshtein distance form a metric space.
A nice property of metric learning is that it is implemented in core Python libraries, such as the implementation of the ball tree index in sklearn.neighbors.BallTree
. With the ball tree, searching for the K nearest neighbors of a becomes an O(log M * N^2) operation. Finding the K nearest neighbors for each of the M texts is then an O(M log M * N^2) operation.
One drawback of the sklearn implementation is that it requires float vectors on the input. Therefore, a mapping scheme from strings to float vectors and back seems required. The solution can represent datasets of up to 2^53 strings, which seems sufficient:
$ pip install sklearn python-Levenshtein
$ python
>>> from typing import Optional
>>> import Levenshtein
>>> import numpy as np
>>> from sklearn.neighbors import BallTree, DistanceMetric
>>>
>>> texts = ["Some string", "Some other string", "Yet another string"]
>>> query = "Smeo srting"
>>> knn = 2
>>>
>>> def text_to_vector(text: str, idx: Optional[int] = None) -> np.ndarray:
... if idx is None:
... idx = len(texts)
... texts.append(text)
... return np.array([idx], dtype=np.float64)
...
>>> def vector_to_text(idx: np.ndarray) -> str:
... return texts[int(idx[0])]
...
>>> def levdist(x: np.ndarray, y: np.ndarray) -> int:
... x, y = map(vector_to_text, (x, y))
... return Levenshtein.distance(x, y)
...
>>> X = [text_to_vector(text, idx) for idx, text in enumerate(texts)]
>>> x = [text_to_vector(query)]
>>>
>>> tree = BallTree(X, metric=DistanceMetric.get_metric('pyfunc', func=levdist))
>>> (dists,), (idxs,) = tree.query(x, k=knn)
>>> for dist, idx in zip(dists, idxs):
... print(f'{texts[idx]}, {int(dist)}')
...
Some string, 4
Some other string, 8
However, the speed still seems a major drawback. I would appreciate pointers to libraries that provide approximate K-nearest-neighbor search over the Levenshtein distance or over metric spaces in near-O(M * N^2) time.