You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation, the __call__ method of nemo_curator/modules/fuzzy_dedup.py, it assumes that at least one LSH duplicate will be found, and the results will be saved as a parquet file. However, if the dataset is clean or too small to have any fuzzy duplicates, the code will throw an error when trying to read the non-existent parquet file.
def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
df = dataset.df
write_path = os.path.join(self.cache_dir, "_buckets.parquet")
t0 = time.time()
with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
self.lsh(write_path=write_path, df=df)
self._logger.info(
f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
)
buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
return DocumentDataset(buckets_df)
A simple enhancement will be to throw a warning or gracefully handle the situation for those who are unfamiliar with the code base.
eg.
def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
df = dataset.df
write_path = os.path.join(self.cache_dir, "_buckets.parquet")
t0 = time.time()
with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
self.lsh(write_path=write_path, df=df)
self._logger.info(
f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
)
if not os.path.exists(write_path):
self._logger.warning("No LSH duplicates found.")
return DocumentDataset(dask_cudf.from_cudf(cudf.DataFrame(), npartitions=1))
buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
return DocumentDataset(buckets_df)
The text was updated successfully, but these errors were encountered:
Thanks for raising @davzoku. I'm working on this as a part of the refactor in #326. Feel free to share any opinions you might have on how the behavior might be handling in that PR.
In the current implementation, the
__call__
method ofnemo_curator/modules/fuzzy_dedup.py
, it assumes that at least one LSH duplicate will be found, and the results will be saved as a parquet file. However, if the dataset is clean or too small to have any fuzzy duplicates, the code will throw an error when trying to read the non-existent parquet file.A simple enhancement will be to throw a warning or gracefully handle the situation for those who are unfamiliar with the code base.
eg.
The text was updated successfully, but these errors were encountered: