I am having a performance problem when running multiple cuml.cluster.KMeans.fit_predict() concurrently on a single machine. There is enough memory on both the GPU and the host. When run in isolation, the function (max_silhouette_score) call takes approximately 1 second. However, when I run the 2x functions concurrently, each one takes around 5 seconds — resulting in an overall 5x slowdown.
Here's the context of my usage:
Environment: GPU RTX-3090 CUDA version 11.8 cuML version 23.08.00
Dataset: The input is a pandas DataFrame with a shape of 3000x20, consisting entirely of numeric and normalized columns.
Function: I am running my max_silhouette_score() function, which internally calls fit_predict() on the dataset for 18 times
Code Snippet:
from cuml.cluster import KMeans
from cuml.metrics.cluster import silhouette_samples,silhouette_score
def max_silhouette_score(df):
sil_scores = []
test_range = range(2,20)
for k in test_range:
kmeans = KMeans(n_clusters=k,n_init=10)
predictions = kmeans.fit_predict(df)
sil = silhouette_samples(df.values, predictions)
sil_scores.append(float(np.mean(sil)))
max_sil_idx = np.argmax(sil_scores)
max_sil = sil_scores[max_sil_idx]
max_sil_k = list(test_range)[max_sil_idx]
return max_sil,max_sil_k
I've confirmed that the machine's resources are not maxed out during the execution. Does anyone have insights into why the concurrent execution is so much slower, or suggestions on how to keep the same performance while running multiple fit_predict() calls on the same machine?