NVidia Rapids: Non-Euclidean metric in cuml UMAP

253 views Asked by At

I am trying to use GPU (A100) to perform UMAP for speedup. I am facing problem as Euclidean metric does not seem to work for me at all but correlation/cosine are promising. However, the code I am using below seems to produce only Euclidean metric based computation on GPU while working well on CPU.

Tools:

cuml                      23.04.01        cuda11_py310_230421_g958186d07_0    rapidsai
libcuml                   23.04.01        cuda11_230421_g958186d07_0          rapidsai
libcumlprims              23.04.00        cuda11_230412_g7502d8e_0            nvidia
python                    3.10.11         he550d4f_0_cpython                  conda-forge

Relevant code:

def umap_cpu(ip_mat, n_components, n_neighbors, metric):
    import umap
    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

def umap_gpu(ip_mat, n_components, n_neighbors, metric):
    import cuml
    from cuml.manifold import UMAP
    from sklearn.preprocessing import StandardScaler

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    scaler = StandardScaler()
    ip_std = scaler.fit_transform(ip_mat)

    reducer = UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
    umap_embed = reducer.fit_transform(ip_std)

    return umap_embed

Using help I noticed that other metrics are supported. However, I found an old post that said otherwise in discussion.

PR will allow the metric for the input KNN graph to be changed but the only supported target metrics currently remain to be categorical and Euclidean. We can support different target metrics (and we have issue open to support them) but they will require a slightly different objective function in the SGD. I do believe there's an error in the throwing of the Python exception (pointed out in this issue)

I would like to know if the implementation has been done for other metrics or the help tool shows wrong info.

metric : string (default='euclidean'). Distance metric to use. Supported distances are ['l1, 'cityblock', 'taxicab', 'manhattan', 'euclidean', 'l2', 'sqeuclidean', 'canberra', 'minkowski', 'chebyshev', 'linf', 'cosine', 'correlation', 'hellinger', 'hamming', 'jaccard'] Metrics that take arguments (such as minkowski) can have arguments passed via the metric_kwds dictionary.

TIA

1

There are 1 answers

1
Mickael On

The metric argument of cuml UMAP is used to specify the distance metric of the KNN graph. This support many distance metrics.

There is another argument target_metric that only support euclidean and categorical.

From your question it seems that support of more target_metric options is actually what you are looking for. Feel free to show your interest for the addition of this feature on this Github issue.