You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A JSONL file > 4 GB cannot currently be processed and will throw a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays error
Versions / Dependencies
Ray 2.38
Reproduction script
First download the dataset with wget https://huggingface.co/datasets/cognitivecomputations/dolphin/resolve/main/flan5m-alpaca-uncensored-deduped.jsonl and then run
The text was updated successfully, but these errors were encountered:
pcmoritz
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 24, 2024
alexeykudinkin
added
P0
Issues that should be fixed in short order
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 25, 2024
What happened + What you expected to happen
A JSONL file > 4 GB cannot currently be processed and will throw a
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
errorVersions / Dependencies
Ray 2.38
Reproduction script
First download the dataset with
wget https://huggingface.co/datasets/cognitivecomputations/dolphin/resolve/main/flan5m-alpaca-uncensored-deduped.jsonl
and then runThis yields
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: