-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317
Comments
Perhaps this if block might be able to use cuML's membership_vector function to align with the CPU hdbscan: BERTopic/bertopic/cluster/_utils.py Lines 47 to 56 in fca5a4f
Or, it could perhaps be updated to reflect that BERTopic/bertopic/cluster/_utils.py Lines 22 to 23 in fca5a4f
|
Ah, it seems indeed that the incorrect function is used there. I believe simply replacing: from cuml.cluster.hdbscan.prediction import approximate_predict
probabilities = approximate_predict(model, embeddings) with this should solve the issue: from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings) I can fix this in an upcoming release. PRs are also greatly appreciated! |
Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from
After looking into the The final code that works for me is from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) |
@slice-pranay Awesome, thanks for diving into this! If you want, it would be great if you create a PR for this. Otherwise, I can also add this in the coming weeks when I find some time. Either way, thanks for this! |
Thanks for surfacing this issue. When used like this, the import cuml
X, y = cuml.make_blobs(n_samples=100, n_features=3)
clf = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True).fit(X)
cuml.cluster.hdbscan.all_points_membership_vectors(clf)[:5]
array([[1.0000000e+00, 4.6776744e-40, 4.0108805e-40],
[4.9417980e-02, 5.5743980e-01, 7.2683059e-02],
[4.8842371e-02, 7.2603232e-01, 1.0369291e-01],
[7.5122565e-01, 5.8568917e-02, 5.3385083e-02],
[4.5487583e-02, 1.0042124e-01, 5.8100939e-01]], dtype=float32) I've filed a cuML issue to track this bug. In the meantime, your suggested workaround makes sense! |
For completeness, this |
Is this actually fixed in cuML 23.08? I have installed cuML using the instructions at https://docs.rapids.ai/install and |
I'm facing the same issue with cuml 23.10.0 and BERTopic 0.16.0, is there a workaround or fix available? |
As of last week, cuML 24.04 is now available. I think it's probably fair to say that almost everyone using cuML with BERTopic is using a version that supports the If there's interest and bandwidth from the maintainers to provide reviews, I'm happy to open a PR that resolves this issue and the implicitly equivalent #1764 (essentially, an updated version of this PR) cc @MaartenGr |
@beckernick Thanks, that would be great! This has been open for way too long (which is definitely my fault!), so a PR that updates this to the |
Sounds good! |
Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue. The original example works with this PR: from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
train = docs[:15000]
test = docs[15000:]
umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)
topics_test, probs_test = topic_model.transform(test)
pd.Series(topics_test).value_counts()
2024-04-30 23:29:26,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████| 469[/469](http://localhost:8888/469) [00:14<00:00, 31.43it[/s](http://localhost:8888/s)]
2024-04-30 23:29:42,841 - BERTopic - Embedding - Completed ✓
2024-04-30 23:29:42,842 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-30 23:29:43,006 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:43,008 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-30 23:29:43,170 - BERTopic - Cluster - Completed ✓
2024-04-30 23:29:43,175 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-30 23:29:46,567 - BERTopic - Representation - Completed ✓
Batches: 100%|█████████████████████████████████████████████████████████| 121[/121](http://localhost:8888/121) [00:03<00:00, 30.64it[/s](http://localhost:8888/s)]
2024-04-30 23:29:51,410 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-04-30 23:29:51,431 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:51,432 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-04-30 23:29:51,439 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-04-30 23:29:51,446 - BERTopic - Probabilities - Completed ✓
2024-04-30 23:29:51,447 - BERTopic - Cluster - Completed ✓
0 1176
-1 551
1 390
2 362
4 221
3 190
5 157
6 155
7 131
8 122
9 95
10 66
11 57
12 42
13 42
14 40
15 20
17 17
16 12
Name: count, dtype: int64 |
That is all too familiar these days! So thanks for taking the time to create the PR. When it passes, I'll go ahead and merge it in preparation for a minor release. |
Hi Maarten
Firstly, thank you for this amazing library. I'm generating topics on newsgroups data for testing and I am using cuML for UMAP and HDBSCAN. I have set the
calculate_probabilites = True
and performed fit_transform() on the data. It worked fine and gave good results. When I try to run transform() on new data it gives an errorAttributeError: 'tuple' object has no attribute 'shape'
. When i setcalculate_probabilities = False
this function works fine.The libraries i am using are
bertopic==0.15.0
cuml-cu11==23.4.1
cudf-cu11==23.4.1
cuda toolkit 11.8
I am running on a virtual ubuntu machine with Tesla T4 GPU.
The code to reproduce this error
The error that comes when i run this
Can you please guide me in solving this error.
The text was updated successfully, but these errors were encountered: