I am writing because I have a problem (silly and obvious introduction, I know).
I am trying to use the BERTopic package using the Python interpreter in RStudio and the reticulate
extension:
Python 3.6.13 (C:/Users/Francesco/AppData/Local/r-miniconda/envs/r-reticulate/python.exe)
Reticulate 1.18.9008 REPL -- A Python interpreter in R.
I managed to install it with
pip3 install bertopic
At first, trying to install bertopic
resulted in an error realating to its hdbscan
dependence, specifically to the wheel used; I overcame it by installing hdbscan by conda (with pip the problem appeared unsolvable) and after doing it seemed that both were installed and fine (pip would confirm so).
Afterwards, I tried to follow the package tutorial in Medium/Towards Data Science (here the Colab version I’m following) to get accostumed with the package and to check that everything was working as supposed to.
I am basically copying and pasting the code of Colab on the Python chunks in the RMarkdown file I am using, but when I try to apply the same code of the tutorial to the same dataset used:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)
I get the following error:
Batches: 100%|##########| 589/589 [28:21<00:00, 2.89s/it]
2021-04-29 16:24:25,973 - BERTopic - Transformed documents to Embeddings
2021-04-29 16:24:35,752 - BERTopic - Reduced dimensionality with UMAP
OSError: [Errno 22] Invalid argument
In theory, following the output on colab, I should get:
....................... - BERTopic - Clustered UMAP embeddings with HDBSCAN
Since I had problem with hdbscan
I do believe it is somehow related to it, and I read several GitHub and Stackoverflow pages pointing out problems with such a package, but I do not know how to solve this, but I really need to since I need to use package for my thesis.
Can someone help me, please?
PS: it's the first time I am asking stuff on stackoverflow: I hoped I wrote down everything necessary, but if some info is missing, please tell me.