Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for cuml hdbscan membership_vector #1324

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

stevetracvc
Copy link
Contributor

cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model

As I was preparing to submit this PR, I saw #1317 and that some of you have already figured this out 😄

I had included batch_size as a new parameter to hdbscan_delegator just in case someone needed to pass in a batch_size somehow

cuml version 23.04 now has membership_vector, so this allows
topic_model.transform() to calculate the probability matrix if
using a cuml-based hdbscan model
@MaartenGr
Copy link
Owner

Thanks for the PR! Did you check whether the try/except clause works? According to the issue that you refer approximate_predict returns a tuple, so I think it will break if you are using cuML < 23.04. Also, I am wondering whether the batch_size parameter should be added since it will be fixed in cuML 23.08.

@stevetracvc stevetracvc force-pushed the cuml-hdbscan-membership_vector branch from 6f2f48e to 2fce29b Compare June 9, 2023 19:04
@stevetracvc
Copy link
Contributor Author

Good catch, that should be an AttributeError instead

I'm fine removing the batch_size param, but I added it so you can use a custom batch size if necessary. I think it's important to leave the check in there, because telling people this function won't work for cuml 23.04 or .06 isn't great.

I just left it as it was, and yes, approximate_predict returns a tuple. What should be the expected return if someone tries to get membership_vector when using an incompatible cuml?

@MaartenGr
Copy link
Owner

Apologies for the late reply. I am wondering whether it would not be better to skip over handling the batch_size parameter since that was fixed in the most recent release. Although I understand that previous versions are still used, I believe most users would use the newest stable release automatically when installing cuML. I'm afraid it will give problems later on when more details about batch_size parameter might change or be removed.

@HeadCase
Copy link

HeadCase commented Aug 1, 2023

I am currently experiencing Cuda out_of_memory (OOM) errors on my (admittedly large) dataset when hdbscan_delegator makes the call to all_points_membership_vectors. I know that I can continue to tweak min_cluster_size/min_samples to reduce memory constraints (or switch off calculate_probabilities), but I would prefer those to be last resorts. My understanding from reading the RAPIDS docs is that manually reducing the batch_size in this call has a chance of fixing my OOM issues. Do we see a possibility of making this batch size user-configurable?

@stevetracvc
Copy link
Contributor Author

@HeadCase what version of cuML are you using? I think version 23.08 uses a batch size of 4096 which likely will fix your problem. You could also take a look at my first commit, which does include a batch_size parameter, if 4096 is too big for you

@stevetracvc
Copy link
Contributor Author

stevetracvc commented Aug 3, 2023

@MaartenGr sorry this dropped off my radar. I get it causing future problems, but rather than removing it I can add in version tests. I think it's 23.04 and 23.06 that need the batch_size patch

eg
if cuml.__version__.startswith("23.04") or cuml.__version__.startswith("23.06"):

@MaartenGr
Copy link
Owner

Seeing as the newest version of cuML resolves the issue. Would it not be straightforward to just mention that users will need to use the latest version? Especially since it was fixed in their latest release.

@stevetracvc
Copy link
Contributor Author

For me, it's a lot easier to update the BERTopic version in my environment than it is to update cuML. I've had major dependency conflicts when trying to install cuML with some of my other packages.

My point being, other people might not be able to update cuML in their environments just to get a feature that technically does work. But I get that you don't want the code cluttered.

Your call. We could just leave the commit as is, and not merge it, and people can patch their copy of BERTopic if needed...

@MaartenGr
Copy link
Owner

Hmmm, in that case, it might indeed be better to check for those specific versions and only implement the fix for these versions. I think that would be more stable and indeed still allows users to keep using the current version. As long as it does not affect newer versions, that should suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants