model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

slice-pranay · 2023-06-02T12:07:24Z

Hi Maarten

Firstly, thank you for this amazing library. I'm generating topics on newsgroups data for testing and I am using cuML for UMAP and HDBSCAN. I have set the calculate_probabilites = True and performed fit_transform() on the data. It worked fine and gave good results. When I try to run transform() on new data it gives an error AttributeError: 'tuple' object has no attribute 'shape'. When i set calculate_probabilities = False this function works fine.

The libraries i am using are
bertopic==0.15.0
cuml-cu11==23.4.1
cudf-cu11==23.4.1
cuda toolkit 11.8

I am running on a virtual ubuntu machine with Tesla T4 GPU.

The code to reproduce this error

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)

The error that comes when i run this

Can you please guide me in solving this error.

The text was updated successfully, but these errors were encountered:

beckernick · 2023-06-02T19:58:39Z

Perhaps this if block might be able to use cuML's membership_vector function to align with the CPU hdbscan:

BERTopic/bertopic/cluster/_utils.py

Lines 47 to 56 in fca5a4f

    
           if func == "membership_vector": 
        
               if isinstance(model, hdbscan.HDBSCAN): 
        
                   probabilities = hdbscan.membership_vector(model, embeddings) 
        
                   return probabilities 
        
               str_type_model = str(type(model)).lower() 
        
               if "cuml" in str_type_model and "hdbscan" in str_type_model: 
        
                   from cuml.cluster.hdbscan.prediction import approximate_predict 
        
                   probabilities = approximate_predict(model, embeddings) 
        
                   return probabilities

Or, it could perhaps be updated to reflect that approximate_predict returns a tuple of (labels, probabilities) (even if only the probabilities will be returned by the function).

BERTopic/bertopic/cluster/_utils.py

Lines 22 to 23 in fca5a4f

    
           predictions, probabilities = hdbscan.approximate_predict(model, embeddings) 
        
           return predictions, probabilities

MaartenGr · 2023-06-03T04:48:46Z

Ah, it seems indeed that the incorrect function is used there. I believe simply replacing:

from cuml.cluster.hdbscan.prediction import approximate_predict 
probabilities = approximate_predict(model, embeddings)

with this should solve the issue:

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings)

I can fix this in an upcoming release. PRs are also greatly appreciated!

slice-pranay · 2023-06-05T06:31:33Z

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error

ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points

After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings)))

MaartenGr · 2023-06-05T07:24:16Z

@slice-pranay Awesome, thanks for diving into this! If you want, it would be great if you create a PR for this. Otherwise, I can also add this in the coming weeks when I find some time. Either way, thanks for this!

beckernick · 2023-06-05T13:29:15Z

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error
ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points
After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is
from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) 

Thanks for surfacing this issue. When used like this, the batch_size parameter shouldn't be necessary (and shouldn't have any effect). This parameter is designed for the scenario when there is a large amount of data and users may want to potentially slightly trade off performance and higher peak memory requirements (though the default batch size of 4096 is likely the right choice as it significantly reduces peak memory requirements with a very minor impact on performance). It should be doing this under the hood, like it is already for all_points_membership_vectors.

import cuml

X, y = cuml.make_blobs(n_samples=100, n_features=3)

clf = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True).fit(X)
cuml.cluster.hdbscan.all_points_membership_vectors(clf)[:5]
array([[1.0000000e+00, 4.6776744e-40, 4.0108805e-40],
       [4.9417980e-02, 5.5743980e-01, 7.2683059e-02],
       [4.8842371e-02, 7.2603232e-01, 1.0369291e-01],
       [7.5122565e-01, 5.8568917e-02, 5.3385083e-02],
       [4.5487583e-02, 1.0042124e-01, 5.8100939e-01]], dtype=float32)

I've filed a cuML issue to track this bug. In the meantime, your suggested workaround makes sense!

beckernick · 2023-06-06T20:40:23Z

For completeness, this membership_vector bug has now been fixed in cuML. It won't be available in the 23.06 stable release that is about to happen, but ~~should be available in the 23.08 nightly packages in about 1 hour~~ is now available in the 23.08 nightly packages.

HeadCase · 2023-09-08T14:07:47Z

Is this actually fixed in cuML 23.08? I have installed cuML using the instructions at https://docs.rapids.ai/install and from cuml import __version__ reports 23.08.00. Running the original poster's code example exactly as-is still produces the AttributeError: 'tuple' object has no attribute 'shape'. Is there something I am missing here?

nilsblessing · 2023-12-07T18:18:14Z

I'm facing the same issue with cuml 23.10.0 and BERTopic 0.16.0, is there a workaround or fix available?

beckernick · 2024-04-15T20:02:04Z

As of last week, cuML 24.04 is now available. I think it's probably fair to say that almost everyone using cuML with BERTopic is using a version that supports the membership_vector function.

If there's interest and bandwidth from the maintainers to provide reviews, I'm happy to open a PR that resolves this issue and the implicitly equivalent #1764 (essentially, an updated version of this PR)

cc @MaartenGr

MaartenGr · 2024-04-18T14:03:57Z

@beckernick Thanks, that would be great! This has been open for way too long (which is definitely my fault!), so a PR that updates this to the membership_vector sounds good. I also intend to release a minor version of BERTopic soon with many fixes, so that would be a nice timing to have this included.

beckernick · 2024-04-18T18:07:23Z

Sounds good!

beckernick · 2024-05-01T03:30:15Z

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

The original example works with this PR:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)
pd.Series(topics_test).value_counts()

2024-04-30 23:29:26,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████| 469[/469](http://localhost:8888/469) [00:14<00:00, 31.43it[/s](http://localhost:8888/s)]
2024-04-30 23:29:42,841 - BERTopic - Embedding - Completed ✓
2024-04-30 23:29:42,842 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-30 23:29:43,006 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:43,008 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-30 23:29:43,170 - BERTopic - Cluster - Completed ✓
2024-04-30 23:29:43,175 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-30 23:29:46,567 - BERTopic - Representation - Completed ✓
Batches: 100%|█████████████████████████████████████████████████████████| 121[/121](http://localhost:8888/121) [00:03<00:00, 30.64it[/s](http://localhost:8888/s)]
2024-04-30 23:29:51,410 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-04-30 23:29:51,431 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:51,432 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-04-30 23:29:51,439 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-04-30 23:29:51,446 - BERTopic - Probabilities - Completed ✓
2024-04-30 23:29:51,447 - BERTopic - Cluster - Completed ✓
 0     1176
-1      551
 1      390
 2      362
 4      221
 3      190
 5      157
 6      155
 7      131
 8      122
 9       95
 10      66
 11      57
 12      42
 13      42
 14      40
 15      20
 17      17
 16      12
Name: count, dtype: int64

MaartenGr · 2024-05-07T14:04:57Z

@beckernick

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

That is all too familiar these days! So thanks for taking the time to create the PR. When it passes, I'll go ahead and merge it in preparation for a minor release.

stevetracvc mentioned this issue Jun 8, 2023

add support for cuml hdbscan membership_vector #1324

Open

MaartenGr mentioned this issue Jan 31, 2024

BERTopic Loading Issue #1764

Closed

beckernick mentioned this issue May 1, 2024

Fix transform when using cuML HDBSCAN #1960

Merged

MaartenGr closed this as completed in #1960 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

slice-pranay commented Jun 2, 2023 •

edited

Loading

beckernick commented Jun 2, 2023 •

edited

Loading

MaartenGr commented Jun 3, 2023

slice-pranay commented Jun 5, 2023

MaartenGr commented Jun 5, 2023

beckernick commented Jun 5, 2023 •

edited

Loading

beckernick commented Jun 6, 2023 •

edited

Loading

HeadCase commented Sep 8, 2023

nilsblessing commented Dec 7, 2023

beckernick commented Apr 15, 2024 •

edited

Loading

MaartenGr commented Apr 18, 2024

beckernick commented Apr 18, 2024

beckernick commented May 1, 2024

MaartenGr commented May 7, 2024

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True #1317

Comments

slice-pranay commented Jun 2, 2023 • edited Loading

beckernick commented Jun 2, 2023 • edited Loading

MaartenGr commented Jun 3, 2023

slice-pranay commented Jun 5, 2023

MaartenGr commented Jun 5, 2023

beckernick commented Jun 5, 2023 • edited Loading

beckernick commented Jun 6, 2023 • edited Loading

HeadCase commented Sep 8, 2023

nilsblessing commented Dec 7, 2023

beckernick commented Apr 15, 2024 • edited Loading

MaartenGr commented Apr 18, 2024

beckernick commented Apr 18, 2024

beckernick commented May 1, 2024

MaartenGr commented May 7, 2024

slice-pranay commented Jun 2, 2023 •

edited

Loading

beckernick commented Jun 2, 2023 •

edited

Loading

beckernick commented Jun 5, 2023 •

edited

Loading

beckernick commented Jun 6, 2023 •

edited

Loading

beckernick commented Apr 15, 2024 •

edited

Loading