[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Maghoumi · 2024-05-15T20:27:31Z

Describe the bug

When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.

Steps/Code to reproduce bug

Clone the repo
Run the TinyStories tutorial
Run examples/fuzzy_deduplication.py on the dataset under tutorials/tinystories/data/jsonl/val/

Expected behavior

The code should not crash.

Environment details
Using NVIDIA pytorch:24.04-py3 image

The text was updated successfully, but these errors were encountered:

glam621 · 2024-06-24T20:55:39Z

Assigning it to @ayushdg

Maghoumi · 2024-09-10T21:18:35Z

@ayushdg I heard @praateekmahajan saying he's not having this issue on his end. I just pulled the latest main branch to confirm, but I still have this error:

$ python examples/fuzzy_deduplication.py --device gpu
Reading 3 files
/NeMo-Curator/nemo_curator/modules/config.py:91: UserWarning: Identifying false positives during the Minhash deduplication is computationally expensive. For improved performance consider setting this to False
  warnings.warn(
Stage1: Starting Minhash + LSH computation
/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py:178: UserWarning: Output path ./fuzzy_cache/_minhashes.parquet already exists and will be overwritten
  warnings.warn(
Traceback (most recent call last):
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 141, in wrapper
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 529, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py", line 536, in read_metadata
    dataset_info = cls._collect_dataset_info(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py", line 1051, in _collect_dataset_info
    ds = pa_ds.dataset(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 785, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 463, in _filesystem_dataset
    fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/pyarrow/dataset.py", line 382, in _ensure_multiple_sources
    raise FileNotFoundError(info.path)
FileNotFoundError: /NeMo-Curator/fuzzy_cache/_buckets.parquet

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 141, in wrapper
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/backends.py", line 603, in read_parquet
    return _default_backend(
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/backends.py", line 514, in _default_backend
    return func(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 143, in wrapper
    raise type(e)(
FileNotFoundError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: /NeMo-Curator/fuzzy_cache/_buckets.parquet

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/NeMo-Curator/examples/fuzzy_deduplication.py", line 110, in <module>
    main(attach_args().parse_args())
  File "/NeMo-Curator/examples/fuzzy_deduplication.py", line 82, in main
    duplicates = fuzzy_dup(dataset=input_dataset)
  File "/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 496, in __call__
    buckets_df = minhashLSH(dataset)
  File "/NeMo-Curator/nemo_curator/modules/meta.py", line 22, in __call__
    dataset = module(dataset)
  File "/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 385, in __call__
    buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cudf/__init__.py", line 38, in read_parquet
    return dd.read_parquet(*args, **kwargs)
  File "/NeMo-Curator/venv/lib/python3.10/site-packages/dask/backends.py", line 143, in wrapper
    raise type(e)(
FileNotFoundError: An error occurred while calling the read_parquet method registered to the cudf backend.
Original Message: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: /NeMo-Curator/fuzzy_cache/_buckets.parquet

Same as before, I feel like some intermediate file is not being found.

ayushdg · 2024-09-11T00:03:21Z

There might have been some confusion. @praateekmahajan is looking into a different issue that cannot be reproduced. This particular issue can be reproduced and still persists. Thanks for confirming @Maghoumi !

praateekmahajan · 2024-09-11T02:47:14Z

@Maghoumi +1, sorry my bad on the confusion. tinystories val does have an error, and even I ran into it. Hopefully can resolve this one soon.

Maghoumi added the bug Something isn't working label May 15, 2024

ryantwolf assigned ayushdg Aug 12, 2024

ayushdg linked a pull request Nov 13, 2024 that will close this issue

Add codepath for computing buckets without int conversion #326

Open

3 tasks

davzoku mentioned this issue Nov 19, 2024

Graceful handling when no LSH duplicates found. #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Maghoumi commented May 15, 2024

glam621 commented Jun 24, 2024

Maghoumi commented Sep 10, 2024

ayushdg commented Sep 11, 2024

praateekmahajan commented Sep 11, 2024

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Comments

Maghoumi commented May 15, 2024

glam621 commented Jun 24, 2024

Maghoumi commented Sep 10, 2024

ayushdg commented Sep 11, 2024

praateekmahajan commented Sep 11, 2024