Skip to content

Commit

Permalink
[Data] Mark num_rows_per_file as experimental (ray-project#48208)
Browse files Browse the repository at this point in the history
If you don't have the access to it, we will shortly find a reviewer and
num_rows_per_file has known issues. In specific:

- If the number of rows per block is larger than the specified value, then Ray Data writes the number of rows per block to each file.
- If specified value isn't divisible by the number of rows per block, then Ray Data writes fewer than the specified value (i.e., the remainder) to some files.
- Writes aren't streamed (in the sense that all blocks for that write are stored in heap memory at the same time), so this feature is prone to OOMing

Signed-off-by: Balaji Veeramani <[email protected]>
  • Loading branch information
bveeramani authored and JP-sDEV committed Nov 14, 2024
1 parent cfcbff1 commit e85f3d1
Showing 1 changed file with 36 additions and 24 deletions.
60 changes: 36 additions & 24 deletions python/ray/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2877,10 +2877,12 @@ def write_parquet(
instead of ``arrow_parquet_args`` if any of your write arguments
can't pickled, or if you'd like to lazily resolve the write
arguments for each dataset block.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: Kwargs passed to :meth:`~ray.remote` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down Expand Up @@ -2988,10 +2990,12 @@ def write_json(
instead of ``pandas_json_args`` if any of your write arguments
can't be pickled, or if you'd like to lazily resolve the write
arguments for each dataset block.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down Expand Up @@ -3172,10 +3176,12 @@ def write_csv(
Use this argument instead of ``arrow_csv_args`` if any of your write
arguments cannot be pickled, or if you'd like to lazily resolve the
write arguments for each dataset block.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down Expand Up @@ -3275,10 +3281,12 @@ def write_tfrecords(
filename_provider: A :class:`~ray.data.datasource.FilenameProvider`
implementation. Use this parameter to customize what your filenames
look like.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down Expand Up @@ -3360,10 +3368,12 @@ def write_webdataset(
filename_provider: A :class:`~ray.data.datasource.FilenameProvider`
implementation. Use this parameter to customize what your filenames
look like.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: Kwargs passed to ``ray.remote`` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down Expand Up @@ -3448,10 +3458,12 @@ def write_numpy(
filename_provider: A :class:`~ray.data.datasource.FilenameProvider`
implementation. Use this parameter to customize what your filenames
look like.
num_rows_per_file: The target number of rows to write to each file. If
``None``, Ray Data writes a system-chosen number of rows to each file.
The specified value is a hint, not a strict limit. Ray Data might write
more or fewer rows to each file.
num_rows_per_file: [Experimental] The target number of rows to write to each
file. If ``None``, Ray Data writes a system-chosen number of rows to
each file. The specified value is a hint, not a strict limit. Ray Data
might write more or fewer rows to each file. In specific, if the number
of rows per block is larger than the specified value, Ray Data writes
the number of rows per block to each file.
ray_remote_args: kwargs passed to :meth:`~ray.remote` in the write tasks.
concurrency: The maximum number of Ray tasks to run concurrently. Set this
to control number of tasks to run concurrently. This doesn't change the
Expand Down

0 comments on commit e85f3d1

Please sign in to comment.