Graceful handling when no LSH duplicates found. #381

davzoku · 2024-11-19T06:22:43Z

In the current implementation, the __call__ method of nemo_curator/modules/fuzzy_dedup.py, it assumes that at least one LSH duplicate will be found, and the results will be saved as a parquet file. However, if the dataset is clean or too small to have any fuzzy duplicates, the code will throw an error when trying to read the non-existent parquet file.

    def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
        df = dataset.df

        write_path = os.path.join(self.cache_dir, "_buckets.parquet")
        t0 = time.time()
        with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
            self.lsh(write_path=write_path, df=df)
        self._logger.info(
            f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
        )

        buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
        return DocumentDataset(buckets_df)

A simple enhancement will be to throw a warning or gracefully handle the situation for those who are unfamiliar with the code base.

eg.

def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
    df = dataset.df

    write_path = os.path.join(self.cache_dir, "_buckets.parquet")
    t0 = time.time()
    with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
        self.lsh(write_path=write_path, df=df)
    self._logger.info(
        f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
    )

    if not os.path.exists(write_path):
        self._logger.warning("No LSH duplicates found.")
        return DocumentDataset(dask_cudf.from_cudf(cudf.DataFrame(), npartitions=1))

    buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
    return DocumentDataset(buckets_df)

The text was updated successfully, but these errors were encountered:

ayushdg · 2024-11-19T06:37:40Z

Thanks for raising @davzoku. I'm working on this as a part of the refactor in #326. Feel free to share any opinions you might have on how the behavior might be handling in that PR.

davzoku · 2024-11-19T07:15:09Z

I see, @ayushdg! i will take a look.

mentioning issue: #67 as the current issue might be a duplicate of this existing issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful handling when no LSH duplicates found. #381

Graceful handling when no LSH duplicates found. #381

davzoku commented Nov 19, 2024

ayushdg commented Nov 19, 2024

davzoku commented Nov 19, 2024

Graceful handling when no LSH duplicates found. #381

Graceful handling when no LSH duplicates found. #381

Comments

davzoku commented Nov 19, 2024

ayushdg commented Nov 19, 2024

davzoku commented Nov 19, 2024