Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarification on method #5

Open
conway-abacus opened this issue Feb 6, 2024 · 0 comments
Open

clarification on method #5

conway-abacus opened this issue Feb 6, 2024 · 0 comments

Comments

@conway-abacus
Copy link

Hi @swj0419 and other authors, thanks for making this code available and easy to run so we can explore contamination in various open-source models.

Given that this repo/approach has gained some adoption in the community in terms of reporting contamination scores on benchmark datasets, I would like to clarify some things about how these scores are calculated:

  • the computed/returned value of the script seems to not be the min-k%-prob formula give in the paper. Is there a reason?
  • "If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained." how this threshold is determined?
  • for the min-k%-prob number, assuming we were to compute it, is there similar guidance on thresholds above which we have high confidence the dataset has been trained?

I could not find these specific details in the original paper - apologies if I missed them. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant