[CodeParrot] Near-deduplication with jaccard similarity #17054

liyongsea · 2022-05-02T19:29:05Z

What does this PR do?

This PR address the code duplication issue describe in this thread
https://twitter.com/miltos1/status/1497126435261083649?s=20&t=v5-vwaEtXLrgZ_GuZHrPKQ

run the code

from datasets import load_dataset
from minhash_deduplication import deduplicate_dataset
ds = load_dataset("lvwerra/codeparrot-clean", split="train")
ds_dedup, duplicate_clusters = deduplicate_dataset(ds)

The function runs in 2:30 (make_duplicate_clusters) + 1:30 (find_extremes) on a 8 cores VM

Orginal dataset size: 5361373
Duplicate cluster: 757944
Files in duplicate cluster: 2677040
Unique files in duplicate cluster: 911947
Filtered dataset size: 3596280

HuggingFaceDocBuilderDev · 2022-05-02T19:44:16Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra

Hi @liyongsea

Thanks for adding deduplication, that's an exciting addition! In terms of structure, I think it would be good to include this in the main preprocessing.py. Since it requires quite a bit of code we could probably leave most of the code in minhash_deduplication.py and then do something like the following in preprocessing.py:

from minhash_deduplication import deduplicate_dataset

# other preprocessing steps

ds = deduplicate_dataset(ds, arg1, arg2, ...)

# save dataset and push to hub

I like that you'll use dataset.map() for the paralellization - it matches well with the rest of the codebase. You can probably also do minhash_iter with a simple map.

liyongsea · 2022-05-06T08:31:35Z

Hi @lvwerra I agree with you. I will do that. The overall code is running now, here are the next steps:

refactor the code to be used in preprocess.py and clean up
document statistics and performance data in the PR
use dataset.map to compute minhash

I will do the deduplication of the validation set in another PR probably.
~~question, does dataset.map put the whole dataset in RAM? I imagine it is not a problem because preprocess.py is already doing so~~

liyongsea · 2022-05-07T17:24:39Z

Hi @lvwerra there is one decision we need make, then the PR will be ready to review.
I mentioned before, we could use dataset.map to compute the minhash. However, there is two steps in the deduplication:

compute minhash for each code file
add into MinHashLSH (can not be parallelized)

In previous function, a queue is used while adding into minhash. It would be difficult to do the same using dataset.map. So the dataset.map implementation will be almost twice slow (to be confirmed ...)
~~I might prefer the dataset.map solution, which makes the code easier to read~~
Finally I choose the initial implementation, which reduce the computation time by half

liyongsea · 2022-05-11T19:59:49Z

Here are some statistic and time performance data:

on the dataset lvwerra/codeparrot-clean
~~Execution time 13h: Execution time: 2:30:00 for make_duplicate_clusters, 11:00:00 for find_cluster_extremes~~

Orginal dataset size: 5361373
Duplicate cluster: 757938
Files in duplicate cluster: 2677039
Unique files in duplicate cluster: 940857
Filtered dataset size: 3625191

~~I think the code is ready for review. If you need to generate a dataset, you can go ahead. I might still need more days to figure out how to do find_cluster_extremes better~~

Please see the next message for update

liyongsea · 2022-05-14T19:41:31Z

multipro_find_extremes is done with multi processing ! This PR is ready for review
Execution time ~3h: Execution time: 2:30:00 for make_duplicate_clusters, 1:00:00 for multipro_find_extremes

Orginal dataset size: 5361373
Duplicate cluster: 757938
Files in duplicate cluster: 2677039
Unique files in duplicate cluster: 940857
Filtered dataset size: 3625191
@lvwerra when review, pay more attention to

Here I use a global parameter to be able to do multi pro in a efficient way

loubnabnl · 2022-05-17T13:06:22Z

examples/research_projects/codeparrot/scripts/preprocessing.py

+    ds = ds.map(preprocess, num_proc=args.num_workers)
+    print(f"Time to preprocess dataset: {time.time()-t_start:.2f}")
+
+    # Deduplicate hashes


For now we're running the script on codeparrot-clean where there are no exact duplicates, but maybe we can still add an argument to choose if we want to do exact deduplication or near deduplication? @lvwerra

examples/research_projects/codeparrot/scripts/minhash_deduplication.py

Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>

loubnabnl

Thanks Jia, the PR looks good just a small comment about the Readme

examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <[email protected]>

liyongsea · 2022-05-24T16:12:56Z

We are good to go, welcome your thought @lvwerra
I will try to run some last test

lvwerra

Looks good to me - just a few minor comments.

examples/research_projects/codeparrot/README.md

examples/research_projects/codeparrot/scripts/minhash_deduplication.py

Co-authored-by: Leandro von Werra <[email protected]>

…17054) * deduplication draft * update style * update style test * dummy test main * rename modules * rename functions * return extremes in deduplicate_clusters * update style * cast str for gzip * update doc string * time processing * use dataset map to compute minhash * fill value for short token * remove da map method * update style * use share object to multiprocess * update style * use f-string and minor fix Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]> * update style * use module parameters * change ds_dedup to ds_filter * save ds_dedup * mv test to script tests * make jaccard threshold a parameter of deduplicate_dataset * update style * add doc strings * update style * add doc string for DuplicationIndex * save files into data dir * update readme * Update examples/research_projects/codeparrot/README.md Co-authored-by: Loubna Ben Allal <[email protected]> * make near deduplication optional * move near deduplication in README * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <[email protected]> * use f string Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>

liyongsea changed the title ~~Deduplication with jaccard similarity~~ [CodeParrot] Deduplication with jaccard similarity May 2, 2022

liyongsea force-pushed the codeparrot_deduplication branch from eb6ea4e to 6ee8f50 Compare May 2, 2022 19:47

lvwerra reviewed May 4, 2022

View reviewed changes

liyongsea force-pushed the codeparrot_deduplication branch from 6ee8f50 to c463bde Compare May 5, 2022 17:43

liyongsea force-pushed the codeparrot_deduplication branch from c535db8 to bf4c338 Compare May 6, 2022 16:45

liyongsea force-pushed the codeparrot_deduplication branch 2 times, most recently from 116801b to a400852 Compare May 11, 2022 19:55

Jia Li added 16 commits May 14, 2022 21:30

deduplication draft

7842832

update style

c24b32c

update style test

d0d6fec

dummy test main

d489572

rename modules

2dab7a8

rename functions

28c800c

return extremes in deduplicate_clusters

3ac2967

update style

b60d265

cast str for gzip

22e626f

update doc string

a036ca2

time processing

6ee984d

use dataset map to compute minhash

32306d2

fill value for short token

989959a

remove da map method

0bc110c

update style

cac2308

use share object to multiprocess

8280787

liyongsea force-pushed the codeparrot_deduplication branch from a400852 to 8280787 Compare May 14, 2022 19:32

update style

15821dc

loubnabnl reviewed May 17, 2022

View reviewed changes

examples/research_projects/codeparrot/scripts/minhash_deduplication.py Outdated Show resolved Hide resolved

liyongsea and others added 15 commits May 20, 2022 20:45

use f-string and minor fix

e71a3f4

Co-authored-by: Leandro von Werra <[email protected]> Co-authored-by: Loubna Ben Allal <[email protected]>

Merge branch 'main' into codeparrot_deduplication

a9312da

update style

0545149

use module parameters

be6faa9

change ds_dedup to ds_filter

9458dd3

save ds_dedup

a1ed605

mv test to script tests

0478fdc

make jaccard threshold a parameter of deduplicate_dataset

eddf6c2

update style

206dbec

Merge branch 'main' into codeparrot_deduplication

4495b19

add doc strings

ed5bf2b

update style

b5dd2eb

add doc string for DuplicationIndex

9161f60

save files into data dir

c08bebb

update readme

dd9bcf7

loubnabnl reviewed May 24, 2022

View reviewed changes

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved

liyongsea and others added 3 commits May 24, 2022 17:55

Update examples/research_projects/codeparrot/README.md

6431ba5

Co-authored-by: Loubna Ben Allal <[email protected]>

make near deduplication optional

319206d

move near deduplication in README

de1491d

lvwerra approved these changes Jun 15, 2022

View reviewed changes

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved

examples/research_projects/codeparrot/scripts/minhash_deduplication.py Outdated Show resolved Hide resolved

liyongsea and others added 3 commits June 16, 2022 14:53

Update examples/research_projects/codeparrot/README.md

192a0d8

Co-authored-by: Leandro von Werra <[email protected]>

Merge branch 'main' into codeparrot_deduplication

7f8be34

use f string

6329f1f

liyongsea changed the title ~~[CodeParrot] Deduplication with jaccard similarity~~ [CodeParrot] Near-deduplication with jaccard similarity Jun 18, 2022

lvwerra merged commit da2bd2a into huggingface:main Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CodeParrot] Near-deduplication with jaccard similarity #17054

[CodeParrot] Near-deduplication with jaccard similarity #17054

liyongsea commented May 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

lvwerra left a comment

liyongsea commented May 6, 2022 •

edited

Loading

liyongsea commented May 7, 2022 •

edited

Loading

liyongsea commented May 11, 2022 •

edited

Loading

liyongsea commented May 14, 2022

loubnabnl May 17, 2022

loubnabnl left a comment

liyongsea commented May 24, 2022

lvwerra left a comment

[CodeParrot] Near-deduplication with jaccard similarity #17054

[CodeParrot] Near-deduplication with jaccard similarity #17054

Conversation

liyongsea commented May 2, 2022 • edited Loading

What does this PR do?

run the code

HuggingFaceDocBuilderDev commented May 2, 2022 • edited Loading

lvwerra left a comment

Choose a reason for hiding this comment

liyongsea commented May 6, 2022 • edited Loading

liyongsea commented May 7, 2022 • edited Loading

liyongsea commented May 11, 2022 • edited Loading

liyongsea commented May 14, 2022

loubnabnl May 17, 2022

Choose a reason for hiding this comment

loubnabnl left a comment

Choose a reason for hiding this comment

liyongsea commented May 24, 2022

lvwerra left a comment

Choose a reason for hiding this comment

liyongsea commented May 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

liyongsea commented May 6, 2022 •

edited

Loading

liyongsea commented May 7, 2022 •

edited

Loading

liyongsea commented May 11, 2022 •

edited

Loading