I am trying to use GPU (A100) to perform UMAP for speedup. I am facing problem as Euclidean metric does not seem to work for me at all but correlation/cosine are promising. However, the code I am using below seems to produce only Euclidean metric based computation on GPU while working well on CPU.
Tools:
cuml 23.04.01 cuda11_py310_230421_g958186d07_0 rapidsai
libcuml 23.04.01 cuda11_230421_g958186d07_0 rapidsai
libcumlprims 23.04.00 cuda11_230412_g7502d8e_0 nvidia
python 3.10.11 he550d4f_0_cpython conda-forge
Relevant code:
def umap_cpu(ip_mat, n_components, n_neighbors, metric):
import umap
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
def umap_gpu(ip_mat, n_components, n_neighbors, metric):
import cuml
from cuml.manifold import UMAP
from sklearn.preprocessing import StandardScaler
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
return umap_embed
Using help
I noticed that other metrics are supported. However, I found an old post that said otherwise in discussion.
PR will allow the metric for the input KNN graph to be changed but the only supported target metrics currently remain to be categorical and Euclidean. We can support different target metrics (and we have issue open to support them) but they will require a slightly different objective function in the SGD. I do believe there's an error in the throwing of the Python exception (pointed out in this issue)
I would like to know if the implementation has been done for other metrics or the help tool shows wrong info.
metric : string (default='euclidean'). Distance metric to use. Supported distances are ['l1, 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.
TIA
The
metric
argument of cuml UMAP is used to specify the distance metric of the KNN graph. This support many distance metrics.There is another argument
target_metric
that only support euclidean and categorical.From your question it seems that support of more
target_metric
options is actually what you are looking for. Feel free to show your interest for the addition of this feature on this Github issue.