Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle duplicate document IDs? #50

Open
mrdrozdov opened this issue May 24, 2024 · 1 comment
Open

How to handle duplicate document IDs? #50

mrdrozdov opened this issue May 24, 2024 · 1 comment

Comments

@mrdrozdov
Copy link

What if I am predicting a ranked list with the same document ID multiple times in different positions. How can I evaluate nDCG for this using pytrec_eval, given that scores are represented as dictionaries?

@seanmacavaney
Copy link
Contributor

Hey @mrdrozdov -- trec_eval itself checks for duplicate documents and raises an error if it finds any. So I'm not sure diverging from this behavior in the python wrapper would make sense.

Even so, many measures are not well-defined in the presence of duplicate documents. E.g., you could get an ndcg score > 1 when duplicates are present. So you'd have to think carefully about what measures are potentially suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants