Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Open
Maghoumi opened this issue May 15, 2024 · 4 comments · May be fixed by #326
Open

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Maghoumi opened this issue May 15, 2024 · 4 comments · May be fixed by #326
Assignees
Labels
bug Something isn't working

Comments

@Maghoumi
Copy link
Collaborator

Describe the bug

When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.

Steps/Code to reproduce bug

  1. Clone the repo
  2. Run the TinyStories tutorial
  3. Run examples/fuzzy_deduplication.py on the dataset under tutorials/tinystories/data/jsonl/val/

Expected behavior

The code should not crash.

Environment details
Using NVIDIA pytorch:24.04-py3 image

@Maghoumi Maghoumi added the bug Something isn't working label May 15, 2024
@glam621
Copy link
Collaborator

glam621 commented Jun 24, 2024

Assigning it to @ayushdg

@Maghoumi
Copy link
Collaborator Author

@ayushdg I heard @praateekmahajan saying he's not having this issue on his end. I just pulled the latest main branch to confirm, but I still have this error:

$ python examples/fuzzy_deduplication.py --device gpu
Reading 3 files
/NeMo-Curator/nemo_curator/modules/config.py:91: UserWarning: Identifying false positives during the Minhash deduplication is computationally expensive. For improved performance consider setting this to False
  warnings.warn(
Stage1: Starting Minhash + LSH computation
/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py:178: UserWarning: Output path ./fuzzy_cache/_minhashes.parquet already exists and will be overwritten
  warnings.warn(
Traceback (most recent call last):
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 141, in wrapper
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 529, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py", line 536, in read_metadata
    dataset_info = cls._collect_dataset_info(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py", line 1051, in _collect_dataset_info
    ds = pa_ds.dataset(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 785, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 463, in _filesystem_dataset
    fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 382, in _ensure_multiple_sources
    raise FileNotFoundError(info.path)
FileNotFoundError: /NeMo-Curator/fuzzy_cache/_buckets.parquet

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 141, in wrapper
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/backends.py", line 603, in read_parquet
    return _default_backend(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/backends.py", line 514, in _default_backend
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 143, in wrapper
    raise type(e)(
FileNotFoundError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: /NeMo-Curator/fuzzy_cache/_buckets.parquet

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/NeMo-Curator/examples/fuzzy_deduplication.py", line 110, in <module>
    main(attach_args().parse_args())
  File "/NeMo-Curator/examples/fuzzy_deduplication.py", line 82, in main
    duplicates = fuzzy_dup(dataset=input_dataset)
  File "/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 496, in __call__
    buckets_df = minhashLSH(dataset)
  File "/NeMo-Curator/nemo_curator/modules/meta.py", line 22, in __call__
    dataset = module(dataset)
  File "/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 385, in __call__
    buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/__init__.py", line 38, in read_parquet
    return dd.read_parquet(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 143, in wrapper
    raise type(e)(
FileNotFoundError: An error occurred while calling the read_parquet method registered to the cudf backend.
Original Message: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: /NeMo-Curator/fuzzy_cache/_buckets.parquet

Same as before, I feel like some intermediate file is not being found.

@ayushdg
Copy link
Collaborator

ayushdg commented Sep 11, 2024

There might have been some confusion. @praateekmahajan is looking into a different issue that cannot be reproduced. This particular issue can be reproduced and still persists. Thanks for confirming @Maghoumi !

@praateekmahajan
Copy link
Collaborator

@Maghoumi +1, sorry my bad on the confusion. tinystories val does have an error, and even I ran into it. Hopefully can resolve this one soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants