-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CodeParrot] Near-deduplication with jaccard similarity #17054
[CodeParrot] Near-deduplication with jaccard similarity #17054
Conversation
The documentation is not available anymore as the PR was closed or merged. |
eb6ea4e
to
6ee8f50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liyongsea
Thanks for adding deduplication, that's an exciting addition! In terms of structure, I think it would be good to include this in the main preprocessing.py
. Since it requires quite a bit of code we could probably leave most of the code in minhash_deduplication.py
and then do something like the following in preprocessing.py
:
from minhash_deduplication import deduplicate_dataset
# other preprocessing steps
ds = deduplicate_dataset(ds, arg1, arg2, ...)
# save dataset and push to hub
I like that you'll use dataset.map()
for the paralellization - it matches well with the rest of the codebase. You can probably also do minhash_iter
with a simple map
.
6ee8f50
to
c463bde
Compare
Hi @lvwerra I agree with you. I will do that. The overall code is running now, here are the next steps:
I will do the deduplication of the validation set in another PR probably. |
c535db8
to
bf4c338
Compare
Hi @lvwerra there is one decision we need make, then the PR will be ready to review.
In previous function, a queue is used while adding into minhash. It would be difficult to do the same using dataset.map. So the dataset.map implementation will be almost twice slow (to be confirmed ...) |
116801b
to
a400852
Compare
Here are some statistic and time performance data: on the dataset lvwerra/codeparrot-clean Orginal dataset size: 5361373
Please see the next message for update |
a400852
to
8280787
Compare
multipro_find_extremes is done with multi processing ! This PR is ready for review Orginal dataset size: 5361373
|
ds = ds.map(preprocess, num_proc=args.num_workers) | ||
print(f"Time to preprocess dataset: {time.time()-t_start:.2f}") | ||
|
||
# Deduplicate hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we're running the script on codeparrot-clean
where there are no exact duplicates, but maybe we can still add an argument to choose if we want to do exact deduplication or near deduplication? @lvwerra
examples/research_projects/codeparrot/scripts/minhash_deduplication.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jia, the PR looks good just a small comment about the Readme
Co-authored-by: Loubna Ben Allal <[email protected]>
We are good to go, welcome your thought @lvwerra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - just a few minor comments.
examples/research_projects/codeparrot/scripts/minhash_deduplication.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Leandro von Werra <[email protected]>
…17054) * deduplication draft * update style * update style test * dummy test main * rename modules * rename functions * return extremes in deduplicate_clusters * update style * cast str for gzip * update doc string * time processing * use dataset map to compute minhash * fill value for short token * remove da map method * update style * use share object to multiprocess * update style * use f-string and minor fix Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]> * update style * use module parameters * change ds_dedup to ds_filter * save ds_dedup * mv test to script tests * make jaccard threshold a parameter of deduplicate_dataset * update style * add doc strings * update style * add doc string for DuplicationIndex * save files into data dir * update readme * Update examples/research_projects/codeparrot/README.md Co-authored-by: Loubna Ben Allal <[email protected]> * make near deduplication optional * move near deduplication in README * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <[email protected]> * use f string Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>
…17054) * deduplication draft * update style * update style test * dummy test main * rename modules * rename functions * return extremes in deduplicate_clusters * update style * cast str for gzip * update doc string * time processing * use dataset map to compute minhash * fill value for short token * remove da map method * update style * use share object to multiprocess * update style * use f-string and minor fix Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]> * update style * use module parameters * change ds_dedup to ds_filter * save ds_dedup * mv test to script tests * make jaccard threshold a parameter of deduplicate_dataset * update style * add doc strings * update style * add doc string for DuplicationIndex * save files into data dir * update readme * Update examples/research_projects/codeparrot/README.md Co-authored-by: Loubna Ben Allal <[email protected]> * make near deduplication optional * move near deduplication in README * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <[email protected]> * use f string Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>
What does this PR do?
This PR address the code duplication issue describe in this thread
https://twitter.com/miltos1/status/1497126435261083649?s=20&t=v5-vwaEtXLrgZ_GuZHrPKQ
run the code
The function runs in 2:30 (make_duplicate_clusters) + 1:30 (find_extremes) on a 8 cores VM