[Train] Shard all Input Datasets in Ray Train by default #37668

woshiyyya · 2023-07-22T00:27:08Z

Description

Currently, Ray Train only shards the "train" datasets by default, leaving all other datasets unsharded. Users can configure the DataConfig to shard these other datasets.

This is not satisfactory because:

Internal sharding behavior is inconsistent among different datasets.
All N worker threads evaluate on the full validation dataset, which wastes computation and takes Nx longer time.

We should consider enabling dataset shading by default.

Use case

No response

The text was updated successfully, but these errors were encountered:

YiranJing · 2023-07-27T03:03:08Z

Common use case in Canva (PL model)

Models have both train and validation datasets. And the trainer performs the validation step on the entire validation dataset in each epoch following the training step.
Some models also have a test dataset, which is evaluated after trainer.fit().

Problem

Currently, the large validation dataset is not shared by default, leading to longer execution times for each epoch as every worker performs validation on the full dataset simultaneously. This slowdown the training process and wastes resources.

To address this issue, we need to add data_config=DataConfig(datasets_to_split=["eval"]). However, this can be confusing since the train dataset gets shared by default

Desired Behavior

The validation dataset should be shared by default.
The same configuration setup for both train and validation dataset sharding is required from the user side (reducing bugs and enhancing ease of understanding).
The validation metric should be similar across 1 -> N workers.
No specific opinion regarding the test dataset. As long as it can work.

woshiyyya added enhancement Request for new feature and/or capability train Ray Train Related Issue data Ray Data-related issues labels Jul 22, 2023

woshiyyya changed the title ~~[Train] Shard all Ray Datasets in Ray Train by default~~ [Train] Shard all Input Datasets in Ray Train by default Jul 22, 2023

matthewdeng mentioned this issue Jul 27, 2023

[Train] Unify Torch based Trainers on the TorchTrainer API ray-project/enhancements#37

Merged

woshiyyya mentioned this issue Aug 21, 2023

[Train] Split all Ray Datasets by default #38694

Merged

8 tasks

c21 closed this as completed in #38694 Aug 23, 2023

woshiyyya self-assigned this Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Shard all Input Datasets in Ray Train by default #37668

[Train] Shard all Input Datasets in Ray Train by default #37668

woshiyyya commented Jul 22, 2023

YiranJing commented Jul 27, 2023 •

edited

Loading

[Train] Shard all Input Datasets in Ray Train by default #37668

[Train] Shard all Input Datasets in Ray Train by default #37668

Comments

woshiyyya commented Jul 22, 2023

Description

Use case

YiranJing commented Jul 27, 2023 • edited Loading

Common use case in Canva (PL model)

Problem

Desired Behavior

YiranJing commented Jul 27, 2023 •

edited

Loading