Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing sample_by to load_dataset when loading text data does not work #6758

Closed
ntoxeg opened this issue Mar 26, 2024 · 1 comment · Fixed by #6792
Closed

Passing sample_by to load_dataset when loading text data does not work #6758

ntoxeg opened this issue Mar 26, 2024 · 1 comment · Fixed by #6792
Assignees

Comments

@ntoxeg
Copy link

ntoxeg commented Mar 26, 2024

Describe the bug

I have a dataset that consists of a bunch of text files, each representing an example. There is an undocumented sample_by argument for the TextConfig class that is used by Text to decide whether to split files into lines, paragraphs or take them whole. Passing sample_by=“document” to load_dataset results in files getting split into lines regardless. I have edited src/datasets/packaged_modules/text/text.py for myself to switch the default and it works fine.

As a side note, the if-else for sample_by will silently load an empty dataset if someone makes a typo in the argument, which is not ideal.

Steps to reproduce the bug

  1. Prepare data as a bunch of files in a directory.
  2. Load that data via load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”).
  3. Inspect the resultant dataset — every item should have the form of {“text”: <a line from a file>}.

Expected behavior

load_dataset(“text”, data_files=<data_dir>/<files_glob>, …, sample_by=“document”) should result in a dataset with items of the form {“text”: <one document>}.

Environment info

  • datasets version: 2.18.0
  • Platform: Linux-5.15.0-1046-nvidia-x86_64-with-glibc2.35
  • Python version: 3.11.8
  • huggingface_hub version: 0.21.4
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.1
  • fsspec version: 2024.2.0
@mariosasko
Copy link
Collaborator

Thanks for reporting! We are working on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants