Add nr_repr_docs parameter #2130

robechun · 2024-08-28T00:23:25Z

What does this PR do?

Adds a parameter that you can pass around for getting the nr_repr_docs. This helps when attempting to get a higher # of representative docs for a given topic.

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?

MaartenGr · 2024-08-29T06:03:58Z

Thank you for this pull request! I believe there were a number of issues open discussing this, which issue does this PR refer to? I remember there was some discussion on how to approach this there. That perspective will help me understand what was needed and what decisions we made before going through this PR.

robechun · 2024-08-29T07:11:11Z

I didn't actually look at past issues—it was more from my own usage that I found this added parameter to be helpful. That said, it looks like this issue is the most pertinent

MaartenGr · 2024-08-29T12:34:58Z

Ah, then you might have missed the contributing guideline! I think it's first important to establish what exactly is needed.

From the issue you shared as well as the connected one, it seems that users also want to re-calculate the number of representative documents after training the model.

Also, what made you choose to add this parameter to .transform rather than the init? This seems like a parameter that should go there.

robechun · 2024-08-29T22:50:51Z

Whoops, missed that part on creating an issue! 😅

WRT what is exactly needed—AFAICT, I think there are three needs/wants based on the specific usage pattern.

Where users simply use fit_transform to get the topic clusters (in my use-case)
Where users want to get representative docs for online topic modeling
Where users want to specify nr_repr_docs when they call merge_topics

This PR would solve for 1 and 3 (given the nr_repr_docs can now be added as an input parameter to either fit_transform and merge_topics) but not necessarily 2.

As for why I included it in transform vs init—my mental model was the following:

init is used for instantiating the different "lego" pieces for the BERTopic orchestration (such as clustering algorithm, representation model, etc).
fit_transform or merge_models is where the user can provide specifics about what they want from the model—for example, the model itself remains the same, but I could run the same model with different documents to produce different outputs. Similarly, I could run the same model with the same documents and different nr_repr_docs to extract a slightly different output (namely the set of representative docs).

Let me know if you think I'm off in any way

MaartenGr · 2024-09-01T07:33:58Z

This PR would solve for 1 and 3 (given the nr_repr_docs can now be added as an input parameter to either fit_transform and merge_topics) but not necessarily 2.

I think it would indeed be nice to give users the ability to choose the number of representative documents. Wouldn't it be easier to update the number of representative documents after fitting the model? That way, it would solve all 3 problems without the need to actually needing to refit the model constantly. The representative documents should not affect the representations (except for LLM-based representation models which has separate parameters for this) so it does not feel like it should be a parameter related to either init or transform. It might be easiest and more related to the task itself to put it after training the model.

As for why I included it in transform vs init—my mental model was the following:

Generally, we would put these kinds of parameter in the init since it is part of the parameter space of the model and is not specific to the behavior of transform. However, as mentioned above, it might actually be better to go with a new function that allows you to update the number of representative documents after having created the model.

Add nr_repr_docs parameter

28c159b

Merge branch 'MaartenGr:master' into master

242001f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nr_repr_docs parameter #2130

Add nr_repr_docs parameter #2130

robechun commented Aug 28, 2024

MaartenGr commented Aug 29, 2024

robechun commented Aug 29, 2024

MaartenGr commented Aug 29, 2024

robechun commented Aug 29, 2024

MaartenGr commented Sep 1, 2024

Add nr_repr_docs parameter #2130

Are you sure you want to change the base?

Add nr_repr_docs parameter #2130

Conversation

robechun commented Aug 28, 2024

What does this PR do?

Before submitting

MaartenGr commented Aug 29, 2024

robechun commented Aug 29, 2024

MaartenGr commented Aug 29, 2024

robechun commented Aug 29, 2024

MaartenGr commented Sep 1, 2024