-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Sem-dedup #130
Enable Sem-dedup #130
Conversation
d7c7b74
to
5d6a695
Compare
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
0524727
to
3179e24
Compare
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @VibhuJawa, just a couple of nits and things I think you missed the first time around.
I have tested the most recent PR (using 10 data files with 12 clusters). The result is consistent with our original result. Thanks, Vibhu! This is the command: python semdedup_example.py --input-data-dir /ads_ds3/data/SemDeDup_BenchMark/datasets/c4/realnewslike/modified --config-file configs_cf.yml The content of configs_cf.yml:
|
Thanks so much for this @faywang123 . Appreciate all the help. @ayushdg , Can we use @faywang123 test above and put in our testing. |
Signed-off-by: Vibhu Jawa <[email protected]>
daf4b67
to
52480aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits then we're set.
After a final review and test, the PR looks good to me. Thanks, @VibhuJawa for all the hard work! |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
@ryantwolf , Addressed the nits, let me know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incredible work, so excited to have this be a part of NeMo Curator
Signed-off-by: Vibhu Jawa <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking but a couple of other suggestions:
- Adding an optional SemDedup import to the top level
modules/__init__.py
file for gpu only environments. Allowing users to do something likefrom nemo_curator import SemDedup
- Adding semantic deduplication in the list of features both in the
README.md
as well as a page indocs/user-guide
Signed-off-by: Vibhu Jawa <[email protected]>
Done .
Added readme. |
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
* Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Shuffle CC result on group before writing out (NVIDIA#110) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst (NVIDIA#113) Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: Vibhu Jawa <[email protected]> * embed by cluster saved Signed-off-by: Vibhu Jawa <[email protected]> * id map script Signed-off-by: Vibhu Jawa <[email protected]> * test commit Signed-off-by: Vibhu Jawa <[email protected]> * add id map script Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <[email protected]> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <[email protected]> * Remove undo changes Signed-off-by: Vibhu Jawa <[email protected]> * Remove rename changes Signed-off-by: Vibhu Jawa <[email protected]> * Fix rename Signed-off-by: Vibhu Jawa <[email protected]> * Readme formatting Signed-off-by: Vibhu Jawa <[email protected]> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * configure max memory using a cli Signed-off-by: Vibhu Jawa <[email protected]> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <[email protected]> * Embedding fixes Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * Working end to end Signed-off-by: Vibhu Jawa <[email protected]> * Minor yaml fixes Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add type hints Signed-off-by: Vibhu Jawa <[email protected]> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <[email protected]> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <[email protected]> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <[email protected]> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <[email protected]> * Clean up scripts Signed-off-by: Vibhu Jawa <[email protected]> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Some more cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add copyright Signed-off-by: Vibhu Jawa <[email protected]> * Fix README.md Signed-off-by: Vibhu Jawa <[email protected]> * Address reviews Signed-off-by: Vibhu Jawa <[email protected]> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <[email protected]> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Fix bad merge Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <[email protected]> * Add sorting to clustering Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Fix Readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <[email protected]> * Add docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <[email protected]> * Use argparse to read config Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Append Readme Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Change config Signed-off-by: Vibhu Jawa <[email protected]> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <[email protected]> * fix docstring Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <[email protected]> * Minor import fix Signed-off-by: Vibhu Jawa <[email protected]> * Empty Commit Signed-off-by: Vibhu Jawa <[email protected]> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <[email protected]> * Fix init Signed-off-by: Vibhu Jawa <[email protected]> * Move comment Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: jgerh <[email protected]> Signed-off-by: avinashvem <[email protected]> Co-authored-by: Andrew Schilling <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: jgerh <[email protected]> Co-authored-by: avinashvem <[email protected]>
Description
This PR builds on top #118 and adds the following features on top of it:
sort_clusters.py
andsemdedup.py
Checklist