Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Cannot load large JSONL file #48236

Open
pcmoritz opened this issue Oct 24, 2024 · 1 comment · May be fixed by #48266
Open

[Data] Cannot load large JSONL file #48236

pcmoritz opened this issue Oct 24, 2024 · 1 comment · May be fixed by #48266
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order

Comments

@pcmoritz
Copy link
Contributor

What happened + What you expected to happen

A JSONL file > 4 GB cannot currently be processed and will throw a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays error

Versions / Dependencies

Ray 2.38

Reproduction script

First download the dataset with wget https://huggingface.co/datasets/cognitivecomputations/dolphin/resolve/main/flan5m-alpaca-uncensored-deduped.jsonl and then run

import ray.data
ds = ray.data.read_json("/home/ray/flan5m-alpaca-uncensored-deduped.jsonl")
ds.count()

This yields

2024-10-24 00:42:15,962 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-10-23_22-45-30_455655_2570/logs/ray-data
2024-10-24 00:42:15,963 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]
Running Dataset. Active & requested resources: 1/32 CPU, 256.0MB/17.9GB object store: : 0.00 row [00:56, ? row/s]2024-10-24 00:43:13,305   ERROR streaming_executor_state.py:485 -- An exception was raised from a task of operator "ReadFiles". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.object store: : 0.00 row [00:56, ? row/s]
⚠️  Dataset execution failed: : 0.00 row [00:57, ? row/s]                                                        
- ExpandPaths: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 52.0B object store: : 1.00 row [00:57, 57.3s/ row]
- ReadFiles: Tasks: 1; Queued blocks: 0; Resources: 1.0 CPU, 256.0MB object store: : 0.00 row [00:57, ? row/s]   
2024-10-24 00:43:13,319 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
2024-10-24 00:43:13,319 ERROR exceptions.py:81 -- Full stack trace:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/exceptions.py", line 49, in handle_trace
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/plan.py", line 433, in execute_to_iterator
    bundle_iter = itertools.chain([next(gen)], gen)
                                   ^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
           ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/legacy_compat.py", line 76, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 153, in get_next
    item = self._outer._output_node.get_output_blocking(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 312, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 232, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor.py", line 287, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 486, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 453, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 105, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 101, in on_data_ready
    ray.get(block_ref)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2745, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 901, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ArrowInvalid): ray::ReadFiles() (pid=42356, ip=10.0.25.228)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 395, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 94, in next
    block_remainder = block.slice(
                      ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 284, in slice
    view = _copy_table(view)
           ^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 727, in _copy_table
    return transform_pyarrow.combine_chunks(table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 346, in combine_chunks
    arr = col.combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 745, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3723, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
---------------------------------------------------------------------------
SystemException                           Traceback (most recent call last)
SystemException: 

The above exception was the direct cause of the following exception:

RayTaskError(ArrowInvalid)                Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds.count()

File ~/anaconda3/lib/python3.12/site-packages/ray/data/dataset.py:2651, in Dataset.count(self)
   2648 # Directly loop over the iterator of `RefBundle`s instead of
   2649 # retrieving a full list of `BlockRef`s.
   2650 total_rows = 0
-> 2651 for ref_bundle in self.iter_internal_ref_bundles():
   2652     num_rows = ref_bundle.num_rows()
   2653     # Executing the dataset always returns blocks with valid `num_rows`.

File ~/anaconda3/lib/python3.12/site-packages/ray/data/dataset.py:4845, in Dataset.iter_internal_ref_bundles(self)
   4827 @ConsumptionAPI(pattern="Examples:")
   4828 @DeveloperAPI
   4829 def iter_internal_ref_bundles(self) -> Iterator[RefBundle]:
   4830     """Get an iterator over ``RefBundles``
   4831     belonging to this Dataset. Calling this function doesn't keep
   4832     the data materialized in-memory.
   (...)
   4842         An iterator over this Dataset's ``RefBundles``.
   4843     """
-> 4845     iter_ref_bundles, _, _ = self._plan.execute_to_iterator()
   4846     self._synchronize_progress_bar()
   4847     return iter_ref_bundles

File ~/anaconda3/lib/python3.12/site-packages/ray/data/exceptions.py:89, in omit_traceback_stdout.<locals>.handle_trace(*args, **kwargs)
     87     raise e.with_traceback(None)
     88 else:
---> 89     raise e.with_traceback(None) from SystemException()

RayTaskError(ArrowInvalid): ray::ReadFiles() (pid=42356, ip=10.0.25.228)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 395, in __call__
    yield output_buffer.next()
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/output_buffer.py", line 94, in next
    block_remainder = block.slice(
                      ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 284, in slice
    view = _copy_table(view)
           ^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_block.py", line 727, in _copy_table
    return transform_pyarrow.combine_chunks(table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 346, in combine_chunks
    arr = col.combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 745, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3723, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Issue Severity

High: It blocks me from completing my task.

@pcmoritz pcmoritz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 24, 2024
@alexeykudinkin alexeykudinkin self-assigned this Oct 25, 2024
@alexeykudinkin alexeykudinkin added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 25, 2024
@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Oct 25, 2024

This is affecting not only JSONL but any data source in the following case:

  1. It's a single file
  2. There's a column that's combined across all rows is taking up > 2Gb

Hence P0

@alexeykudinkin alexeykudinkin added the data Ray Data-related issues label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants