Problem with hdbscan used with bertopic: OSError: [Errno 22] Invalid argument

845 views Asked by At

I am writing because I have a problem (silly and obvious introduction, I know).

I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate extension:

Python 3.6.13 (C:/Users/Francesco/AppData/Local/r-miniconda/envs/r-reticulate/python.exe)
Reticulate 1.18.9008 REPL -- A Python interpreter in R.

I managed to install it with pip3 install bertopic

At first, trying to install bertopic resulted in an error realating to its hdbscan dependence, specifically to the wheel used; I overcame it by installing hdbscan by conda (with pip the problem appeared unsolvable) and after doing it seemed that both were installed and fine (pip would confirm so).

Afterwards, I tried to follow the package tutorial in Medium/Towards Data Science (here the Colab version I’m following) to get accostumed with the package and to check that everything was working as supposed to.

I am basically copying and pasting the code of Colab on the Python chunks in the RMarkdown file I am using, but when I try to apply the same code of the tutorial to the same dataset used:

from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)

topics, probs = topic_model.fit_transform(docs)

I get the following error:

Batches: 100%|##########| 589/589 [28:21<00:00, 2.89s/it]
2021-04-29 16:24:25,973 - BERTopic - Transformed documents to Embeddings
2021-04-29 16:24:35,752 - BERTopic - Reduced dimensionality with UMAP
OSError: [Errno 22] Invalid argument

In theory, following the output on colab, I should get:

....................... - BERTopic - Clustered UMAP embeddings with HDBSCAN

Since I had problem with hdbscan I do believe it is somehow related to it, and I read several GitHub and Stackoverflow pages pointing out problems with such a package, but I do not know how to solve this, but I really need to since I need to use package for my thesis.

Can someone help me, please?

PS: it's the first time I am asking stuff on stackoverflow: I hoped I wrote down everything necessary, but if some info is missing, please tell me.


There are 0 answers