[Train] Unify Torch based Trainers on the TorchTrainer API #37

matthewdeng · 2023-07-21T03:33:18Z

This REP proposes to remove the LightningTrainer, TransformersTrainer, and AccelerateTrainer APIs and unify them on the TorchTrainer API.

Signed-off-by: Matthew Deng <[email protected]>

reps/2023-07-20-torch-trainer-apis.md

ericl

How about let's cover the proposed changes to TorchCheckpoint (and Checkpoint<>Trainer a bit in general) as well in this REP?

In addition, as you brought up we can also add examples of providing datasets=... in this REP for completeness.

Signed-off-by: Matthew Deng <[email protected]>

krfricke

Much in favor for this change

reps/2023-07-20-torch-trainer-apis.md

krfricke · 2023-07-25T05:52:31Z

reps/2023-07-20-torch-trainer-apis.md

+    3. The user cannot see the internal training loop, which makes it hard to implement and debug their code.
+2. The `LightningTrainer` and `TransformersTrainer` APIs are opionated and may not allow the user to fully express their desired training logic (e.g. validation and testing).
+
+This proposal explores the idea of centralizing on a single `TorchTrainer` as the single way running training code for PyTorch-based frameworks in a distributed fashion.


I also want to call out that we do this for TensorflowTrainer (Native TF vs. Keras) already (though the surface area is smaller with only 2 APIs).

Signed-off-by: Matthew Deng <[email protected]>

ericl · 2023-07-25T18:58:15Z

reps/2023-07-20-torch-trainer-apis.md

-        callbacks=[checkpoint_callback],
+        devices=ray.train.lightning.get_devices(),
+        strategy=ray.train.lightning.RayDDPStrategy(),
+        plugins=[ray.train.lightning.RayEnvironment()],


Suggested change

plugins=[ray.train.lightning.RayEnvironment()],

plugins=[ray.train.lightning.RayEnvPlugin()],

Hmm this is a subclass of the LightningEnvironment class which is a Plugin.

What about RayLightningEnvironment?

ericl

Good to move forward with broader review/vote.

krfricke

LG

Signed-off-by: Matthew Deng <[email protected]>

woshiyyya

LGTM. Let's make it happen

YiranJing

Vote for this proposal!

That's great! Migrating a normal PyTorch Lightning model to Ray Trainer will be much easier then 🎉 , and it will require less effort for users familiar with PyTorch Lightning but new to Ray.

YiranJing · 2023-07-27T00:46:12Z

reps/2023-07-20-torch-trainer-apis.md

+    eval_dataset = ray.train.get_dataset_shard("eval")
+    eval_dataset = RayDataIterableDataset(val_dataset)


Suggested change

eval_dataset = ray.train.get_dataset_shard("eval")

eval_dataset = RayDataIterableDataset(val_dataset)

eval_dataset_shard = ray.train.get_dataset_shard("eval")

eval_dataset = RayDataIterableDataset(val_dataset_shard)

Ah in this example I wanted to intentionally show that only train is being sharded (since there is no DataConfig specified).

get it! the function name (ray.train.get_dataset_shard("eval")) is a bit confusing -> makes me feel the example is also sharding the validation dataset -> this issue should be able to addressed in ray-project/ray#37668

YiranJing · 2023-07-27T01:12:06Z

reps/2023-07-20-torch-trainer-apis.md

+
+train_dataset = ray.data.read_parquet(...).map_batches(...)
+eval_dataset = ray.data.read_parquet(...).map_batches(...)
+trainer = ray.train.torch.TorchTrainer(train_func, datasets={"train": train_dataset, "eval": eval_dataset})


Currently, we require the use of DataConfig for correctly validation data sharding, e.g.:

trainer = ray.train.torch.TorchTrainer( train_func, datasets={"train": train_dataset, "eval": eval_dataset}, data_config=DataConfig(datasets_to_split=["train", "eval"]) # required )

@woshiyyya I agree it would be good and result in fewer bugs if users could easily perform dataset sharding without the need for data_config.

Hi @YiranJing , thanks for the feedback! I agree that sharding validation datasets by default make a lot of sense.

But we may still need to keep data_config, but we can slightly change the default behavior so that most users don't have to provide an extra DataConfig in TorchTrainer.

If some users do have special use case, and want to have a unsharded dataset on each worker, they can then provide a DataConfig. e.g.

trainer = ray.train.torch.TorchTrainer( train_func, datasets={"train": train_dataset, "eval": eval_dataset, "special": no_split_dataset} data_config=DataConfig(datasets_no_split=["special"]) )

Thanks for clarifying!

@YiranJing could you share more about your use case and desired behavior in ray-project/ray#37668?

replied here ray-project/ray#37668 (comment)

zhe-thoughts · 2023-07-31T22:36:06Z

Great work and thanks @matthewdeng

matthewdeng added 3 commits July 20, 2023 20:28

[Train] Unify Torch based Trainers on the TorchTrainer API

15aefa8

Signed-off-by: Matthew Deng <[email protected]>

add .md suffix

69f56d7

Signed-off-by: Matthew Deng <[email protected]>

add .md suffix

b39c071

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng requested review from justinvyu, krfricke and woshiyyya July 21, 2023 03:33

matthewdeng assigned ericl Jul 21, 2023

woshiyyya reviewed Jul 21, 2023

View reviewed changes

reps/2023-07-20-torch-trainer-apis.md Show resolved Hide resolved

reps/2023-07-20-torch-trainer-apis.md Show resolved Hide resolved

reps/2023-07-20-torch-trainer-apis.md Outdated Show resolved Hide resolved

ericl reviewed Jul 24, 2023

View reviewed changes

reps/2023-07-20-torch-trainer-apis.md Outdated Show resolved Hide resolved

ericl reviewed Jul 24, 2023

View reviewed changes

matthewdeng added 3 commits July 24, 2023 17:46

apply changes from yunxuan's review

c57831b

Signed-off-by: Matthew Deng <[email protected]>

make imports more explicit

a0bfb02

Signed-off-by: Matthew Deng <[email protected]>

add reporting and checkpointing logic

7cdcdb4

Signed-off-by: Matthew Deng <[email protected]>

krfricke reviewed Jul 25, 2023

View reviewed changes

matthewdeng added 2 commits July 25, 2023 08:12

address comments

2ea618c

Signed-off-by: Matthew Deng <[email protected]>

simplify code snippet

5f2d440

Signed-off-by: Matthew Deng <[email protected]>

ericl reviewed Jul 25, 2023

View reviewed changes

ericl approved these changes Jul 25, 2023

View reviewed changes

ericl added the pending-committer-vote label Jul 25, 2023

krfricke approved these changes Jul 25, 2023

View reviewed changes

matthewdeng added 3 commits July 25, 2023 17:05

rename RayLightningEnvironment

786847d

Signed-off-by: Matthew Deng <[email protected]>

update to call train.get_checkpoint() directly

565fae7

Signed-off-by: Matthew Deng <[email protected]>

add data ingestion

3a340e7

Signed-off-by: Matthew Deng <[email protected]>

woshiyyya approved these changes Jul 26, 2023

View reviewed changes

YiranJing approved these changes Jul 27, 2023

View reviewed changes

zhe-thoughts merged commit 6ba5e26 into main Jul 31, 2023

jjyao deleted the torch-trainer-apis branch October 2, 2023 20:17

kamil-kaczmarek mentioned this pull request Nov 22, 2023

[Bug]: ray-project/ray-educational-materials#104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Unify Torch based Trainers on the TorchTrainer API #37

[Train] Unify Torch based Trainers on the TorchTrainer API #37

matthewdeng commented Jul 21, 2023

ericl left a comment

krfricke left a comment

krfricke Jul 25, 2023

ericl Jul 25, 2023

matthewdeng Jul 25, 2023

ericl left a comment

krfricke left a comment

woshiyyya left a comment

YiranJing left a comment

YiranJing Jul 27, 2023

matthewdeng Jul 27, 2023

YiranJing Jul 27, 2023 •

edited

Loading

YiranJing Jul 27, 2023

woshiyyya Jul 27, 2023 •

edited

Loading

YiranJing Jul 27, 2023

matthewdeng Jul 27, 2023

YiranJing Jul 27, 2023

zhe-thoughts commented Jul 31, 2023

	plugins=[ray.train.lightning.RayEnvironment()],
	plugins=[ray.train.lightning.RayEnvPlugin()],

		eval_dataset = ray.train.get_dataset_shard("eval")
		eval_dataset = RayDataIterableDataset(val_dataset)

[Train] Unify Torch based Trainers on the TorchTrainer API #37

[Train] Unify Torch based Trainers on the TorchTrainer API #37

Conversation

matthewdeng commented Jul 21, 2023

ericl left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

YiranJing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YiranJing Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhe-thoughts commented Jul 31, 2023

YiranJing Jul 27, 2023 •

edited

Loading

woshiyyya Jul 27, 2023 •

edited

Loading