Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Shard all Input Datasets in Ray Train by default #37668

Closed
woshiyyya opened this issue Jul 22, 2023 · 1 comment · Fixed by #38694
Closed

[Train] Shard all Input Datasets in Ray Train by default #37668

woshiyyya opened this issue Jul 22, 2023 · 1 comment · Fixed by #38694
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability train Ray Train Related Issue

Comments

@woshiyyya
Copy link
Member

Description

Currently, Ray Train only shards the "train" datasets by default, leaving all other datasets unsharded. Users can configure the DataConfig to shard these other datasets.

This is not satisfactory because:

  • Internal sharding behavior is inconsistent among different datasets.
  • All N worker threads evaluate on the full validation dataset, which wastes computation and takes Nx longer time.

We should consider enabling dataset shading by default.

Use case

No response

@woshiyyya woshiyyya added enhancement Request for new feature and/or capability train Ray Train Related Issue data Ray Data-related issues labels Jul 22, 2023
@woshiyyya woshiyyya changed the title [Train] Shard all Ray Datasets in Ray Train by default [Train] Shard all Input Datasets in Ray Train by default Jul 22, 2023
@YiranJing
Copy link

YiranJing commented Jul 27, 2023

Common use case in Canva (PL model)

  • Models have both train and validation datasets. And the trainer performs the validation step on the entire validation dataset in each epoch following the training step.
  • Some models also have a test dataset, which is evaluated after trainer.fit().

Problem

Currently, the large validation dataset is not shared by default, leading to longer execution times for each epoch as every worker performs validation on the full dataset simultaneously. This slowdown the training process and wastes resources.

To address this issue, we need to add data_config=DataConfig(datasets_to_split=["eval"]). However, this can be confusing since the train dataset gets shared by default

Desired Behavior

  • The validation dataset should be shared by default.
  • The same configuration setup for both train and validation dataset sharding is required from the user side (reducing bugs and enhancing ease of understanding).
  • The validation metric should be similar across 1 -> N workers.
  • No specific opinion regarding the test dataset. As long as it can work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability train Ray Train Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants