[Data] Cannot load large JSONL file #48236

pcmoritz · 2024-10-24T01:04:03Z

What happened + What you expected to happen

A JSONL file > 4 GB cannot currently be processed and will throw a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays error

Versions / Dependencies

Ray 2.38

Reproduction script

First download the dataset with wget https://huggingface.co/datasets/cognitivecomputations/dolphin/resolve/main/flan5m-alpaca-uncensored-deduped.jsonl and then run

import ray.data
ds = ray.data.read_json("/home/ray/flan5m-alpaca-uncensored-deduped.jsonl")
ds.count()

This yields

2024-10-24 00:42:15,962 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-10-23_22-45-30_455655_2570/logs/ray-data
2024-10-24 00:42:15,963 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]
Running Dataset. Active & requested resources: 1/32 CPU, 256.0MB/17.9GB object store: : 0.00 row [00:56, ? row/s]2024-10-24 00:43:13,305   ERROR streaming_executor_state.py:485 -- An exception was raised from a task of operator "ReadFiles". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.object store: : 0.00 row [00:56, ? row/s]
⚠️  Dataset execution failed: : 0.00 row [00:57, ? row/s]                                                        
- ExpandPaths: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 52.0B object store: : 1.00 row [00:57, 57.3s/ row]
- ReadFiles: Tasks: 1; Queued blocks: 0; Resources: 1.0 CPU, 256.0MB object store: : 0.00 row [00:57, ? row/s]   
2024-10-24 00:43:13,319 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-10-24 00:43:13,319 ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/plan.py", line 433, in execute_to_iterator
    bundle_iter = itertools.chain([next(gen)], gen)
                                   ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
           ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/legacy_compat.py", line 76, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 153, in get_next
    item = self._outer._output_node.get_output_blocking(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 312, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 232, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 287, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 486, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 453, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 105, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 101, in on_data_ready
    ray.get(block_ref)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2745, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 901, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ArrowInvalid): ray::ReadFiles() (pid=42356, ip=10.0.25.228)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 395, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 94, in next
    block_remainder = block.slice(
                      ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 284, in slice
    view = _copy_table(view)
           ^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 727, in _copy_table
    return transform_pyarrow.combine_chunks(table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 346, in combine_chunks
    arr = col.combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 745, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3723, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
---------------------------------------------------------------------------
SystemException                           Traceback (most recent call last)
SystemException: 

The above exception was the direct cause of the following exception:

RayTaskError(ArrowInvalid)                Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds.count()

File ~/anaconda3/lib/python3.12/site-packages/ray/data/dataset.py:2651, in Dataset.count(self)
   2648 # Directly loop over the iterator of `RefBundle`s instead of
   2649 # retrieving a full list of `BlockRef`s.
   2650 total_rows = 0
-> 2651 for ref_bundle in self.iter_internal_ref_bundles():
   2652     num_rows = ref_bundle.num_rows()
   2653     # Executing the dataset always returns blocks with valid `num_rows`.

File ~/anaconda3/lib/python3.12/site-packages/ray/data/dataset.py:4845, in Dataset.iter_internal_ref_bundles(self)
   4827 @ConsumptionAPI(pattern="Examples:")
   4828 @DeveloperAPI
   4829 def iter_internal_ref_bundles(self) -> Iterator[RefBundle]:
   4830     """Get an iterator over ``RefBundles``
   4831     belonging to this Dataset. Calling this function doesn't keep
   4832     the data materialized in-memory.
   (...)
   4842         An iterator over this Dataset's ``RefBundles``.
   4843     """
-> 4845     iter_ref_bundles, _, _ = self._plan.execute_to_iterator()
   4846     self._synchronize_progress_bar()
   4847     return iter_ref_bundles

File ~/anaconda3/lib/python3.12/site-packages/ray/data/exceptions.py:89, in omit_traceback_stdout.<locals>.handle_trace(*args, **kwargs)
     87     raise e.with_traceback(None)
     88 else:
---> 89     raise e.with_traceback(None) from SystemException()

RayTaskError(ArrowInvalid): ray::ReadFiles() (pid=42356, ip=10.0.25.228)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 395, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 94, in next
    block_remainder = block.slice(
                      ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 284, in slice
    view = _copy_table(view)
           ^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 727, in _copy_table
    return transform_pyarrow.combine_chunks(table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 346, in combine_chunks
    arr = col.combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 745, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3723, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

alexeykudinkin · 2024-10-25T01:13:15Z

This is affecting not only JSONL but any data source in the following case:

It's a single file
There's a column that's combined across all rows is taking up > 2Gb

Hence P0

pcmoritz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 24, 2024

alexeykudinkin linked a pull request Oct 25, 2024 that will close this issue

[Data] Fix OutputBlockBuffer to avoid repeatedly copying remainder block #48266

Open

8 tasks

alexeykudinkin self-assigned this Oct 25, 2024

alexeykudinkin added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 25, 2024

alexeykudinkin linked a pull request Oct 25, 2024 that will close this issue

[Data] Fix OutputBlockBuffer to avoid repeatedly copying remainder block #48266

Open

8 tasks

alexeykudinkin added the data Ray Data-related issues label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Cannot load large JSONL file #48236

[Data] Cannot load large JSONL file #48236

pcmoritz commented Oct 24, 2024

alexeykudinkin commented Oct 25, 2024 •

edited

Loading

[Data] Cannot load large JSONL file #48236

[Data] Cannot load large JSONL file #48236

Comments

pcmoritz commented Oct 24, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

alexeykudinkin commented Oct 25, 2024 • edited Loading

alexeykudinkin commented Oct 25, 2024 •

edited

Loading