Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash #369

Open
freinold opened this issue Oct 17, 2024 · 1 comment

Comments

@freinold
Copy link

Found thus only cause I was curious how fastembed calculates token ids for bm25/bm42, so I took a deep dive:

Problem

The function compute_token_id computes the absolute value of a signed 32 bit integer returned by the mmh3 hash lib.

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return abs(mmh3.hash(token))

This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.

Is this expexted behaviour?

Proposed Solution

The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native).
This could be achieved by specifying the following extra arguments:

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return mmh3.hash(token, seed=0, signed=False)

I'm happy to provide a PR if you would like this behaviour to be changed.

@joein
Copy link
Member

joein commented Oct 18, 2024

Hi @freinold ,

Thank you for highlighting this issue!
We've decided to take a bit more sophisticated approach than the suggested, and it requires changes in the core.
I'll post here a link to the issue/pr in the core once it is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants