Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CodeParrot] Near-deduplication with jaccard similarity #17054

Merged
merged 38 commits into from
Jun 21, 2022

Conversation

liyongsea
Copy link
Contributor

@liyongsea liyongsea commented May 2, 2022

What does this PR do?

This PR address the code duplication issue describe in this thread
https://twitter.com/miltos1/status/1497126435261083649?s=20&t=v5-vwaEtXLrgZ_GuZHrPKQ

run the code

from datasets import load_dataset
from minhash_deduplication import deduplicate_dataset
ds = load_dataset("lvwerra/codeparrot-clean", split="train")
ds_dedup, duplicate_clusters = deduplicate_dataset(ds)

The function runs in 2:30 (make_duplicate_clusters) + 1:30 (find_extremes) on a 8 cores VM

Orginal dataset size: 5361373
Duplicate cluster: 757944
Files in duplicate cluster: 2677040
Unique files in duplicate cluster: 911947
Filtered dataset size: 3596280

@liyongsea liyongsea changed the title Deduplication with jaccard similarity [CodeParrot] Deduplication with jaccard similarity May 2, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 2, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liyongsea

Thanks for adding deduplication, that's an exciting addition! In terms of structure, I think it would be good to include this in the main preprocessing.py. Since it requires quite a bit of code we could probably leave most of the code in minhash_deduplication.py and then do something like the following in preprocessing.py:

from minhash_deduplication import deduplicate_dataset

# other preprocessing steps

ds = deduplicate_dataset(ds, arg1, arg2, ...)

# save dataset and push to hub

I like that you'll use dataset.map() for the paralellization - it matches well with the rest of the codebase. You can probably also do minhash_iter with a simple map.

@liyongsea
Copy link
Contributor Author

liyongsea commented May 6, 2022

Hi @lvwerra I agree with you. I will do that. The overall code is running now, here are the next steps:

  • refactor the code to be used in preprocess.py and clean up
  • document statistics and performance data in the PR
  • use dataset.map to compute minhash

I will do the deduplication of the validation set in another PR probably.
question, does dataset.map put the whole dataset in RAM? I imagine it is not a problem because preprocess.py is already doing so

@liyongsea
Copy link
Contributor Author

liyongsea commented May 7, 2022

Hi @lvwerra there is one decision we need make, then the PR will be ready to review.
I mentioned before, we could use dataset.map to compute the minhash. However, there is two steps in the deduplication:

  • compute minhash for each code file
  • add into MinHashLSH (can not be parallelized)

In previous function, a queue is used while adding into minhash. It would be difficult to do the same using dataset.map. So the dataset.map implementation will be almost twice slow (to be confirmed ...)
I might prefer the dataset.map solution, which makes the code easier to read
Finally I choose the initial implementation, which reduce the computation time by half

@liyongsea liyongsea force-pushed the codeparrot_deduplication branch 2 times, most recently from 116801b to a400852 Compare May 11, 2022 19:55
@liyongsea
Copy link
Contributor Author

liyongsea commented May 11, 2022

Here are some statistic and time performance data:

on the dataset lvwerra/codeparrot-clean
Execution time 13h: Execution time: 2:30:00 for make_duplicate_clusters, 11:00:00 for find_cluster_extremes

Orginal dataset size: 5361373
Duplicate cluster: 757938
Files in duplicate cluster: 2677039
Unique files in duplicate cluster: 940857
Filtered dataset size: 3625191

I think the code is ready for review. If you need to generate a dataset, you can go ahead. I might still need more days to figure out how to do find_cluster_extremes better

Please see the next message for update

@liyongsea
Copy link
Contributor Author

multipro_find_extremes is done with multi processing ! This PR is ready for review
Execution time ~3h: Execution time: 2:30:00 for make_duplicate_clusters, 1:00:00 for multipro_find_extremes

Orginal dataset size: 5361373
Duplicate cluster: 757938
Files in duplicate cluster: 2677039
Unique files in duplicate cluster: 940857
Filtered dataset size: 3625191
@lvwerra when review, pay more attention to

  • Here I use a global parameter to be able to do multi pro in a efficient way

ds = ds.map(preprocess, num_proc=args.num_workers)
print(f"Time to preprocess dataset: {time.time()-t_start:.2f}")

# Deduplicate hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we're running the script on codeparrot-clean where there are no exact duplicates, but maybe we can still add an argument to choose if we want to do exact deduplication or near deduplication? @lvwerra

Copy link
Contributor

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jia, the PR looks good just a small comment about the Readme

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
@liyongsea
Copy link
Contributor Author

We are good to go, welcome your thought @lvwerra
I will try to run some last test

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - just a few minor comments.

@liyongsea liyongsea changed the title [CodeParrot] Deduplication with jaccard similarity [CodeParrot] Near-deduplication with jaccard similarity Jun 18, 2022
@lvwerra lvwerra merged commit da2bd2a into huggingface:main Jun 21, 2022
younesbelkada pushed a commit to younesbelkada/transformers that referenced this pull request Jun 25, 2022
…17054)

* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix

Co-authored-by: Leandro von Werra <[email protected]>
Co-authored-by: Loubna Ben Allal <[email protected]>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <[email protected]>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <[email protected]>

* use f string

Co-authored-by: Leandro von Werra <[email protected]>
Co-authored-by: Loubna Ben Allal <[email protected]>
younesbelkada pushed a commit to younesbelkada/transformers that referenced this pull request Jun 29, 2022
…17054)

* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix

Co-authored-by: Leandro von Werra <[email protected]>
Co-authored-by: Loubna Ben Allal <[email protected]>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <[email protected]>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <[email protected]>

* use f string

Co-authored-by: Leandro von Werra <[email protected]>
Co-authored-by: Loubna Ben Allal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants