-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fuzzy deduplication fails on datasets with no duplicates #67
Comments
Assigning it to @ayushdg |
@ayushdg I heard @praateekmahajan saying he's not having this issue on his end. I just pulled the latest main branch to confirm, but I still have this error:
Same as before, I feel like some intermediate file is not being found. |
There might have been some confusion. @praateekmahajan is looking into a different issue that cannot be reproduced. This particular issue can be reproduced and still persists. Thanks for confirming @Maghoumi ! |
@Maghoumi +1, sorry my bad on the confusion. |
Describe the bug
When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.
Steps/Code to reproduce bug
examples/fuzzy_deduplication.py
on the dataset undertutorials/tinystories/data/jsonl/val/
Expected behavior
The code should not crash.
Environment details
Using NVIDIA pytorch:24.04-py3 image
The text was updated successfully, but these errors were encountered: