Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod demo #899

Merged
merged 8 commits into from
Feb 23, 2024
Merged

Horovod demo #899

merged 8 commits into from
Feb 23, 2024

Conversation

porteratzo
Copy link
Collaborator

@porteratzo porteratzo commented Nov 29, 2023

PR to integrate a new workspace template called torch_llm_horovod, it uses pt_model as the collaborator task runner and launches a horovod cluster in the train_batches and validate steps with launch_horovod, this launches in InHorovodrun.py which creates the LLMTrainer and loads the datasets and trains or validates the model. The taskrunner and dataloader in the collaborator handles saving the data and model state for the LLMTrainer to load and return to the taskrunner.

@psfoley
Copy link
Contributor

psfoley commented Feb 7, 2024

@porteratzo could you add a description to this PR?

Copy link
Contributor

@psfoley psfoley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great contribution, @porteratzo. A couple small requested changes, and then I think this is ready to merge.

openfl-tutorials/experimental/LLM_Horovod.MD Outdated Show resolved Hide resolved
@psfoley psfoley self-requested a review February 23, 2024 19:29
@psfoley psfoley merged commit 9e9047f into securefederatedai:develop Feb 23, 2024
24 of 26 checks passed
nammbash pushed a commit to nammbash/openfl that referenced this pull request Feb 27, 2024
* initial commit

* changes

* name changes, added readme

* remove files, new dataloader inherit

* move files, flake changes

* flake fix

---------

Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: nammbash <[email protected]>
nammbash pushed a commit to nammbash/openfl that referenced this pull request Feb 27, 2024
* initial commit

* changes

* name changes, added readme

* remove files, new dataloader inherit

* move files, flake changes

* flake fix

---------

Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: nammbash <[email protected]>
nammbash pushed a commit to nammbash/openfl that referenced this pull request Feb 29, 2024
* initial commit

* changes

* name changes, added readme

* remove files, new dataloader inherit

* move files, flake changes

* flake fix

---------

Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: nammbash <[email protected]>
nammbash pushed a commit to nammbash/openfl that referenced this pull request Feb 29, 2024
* initial commit

* changes

* name changes, added readme

* remove files, new dataloader inherit

* move files, flake changes

* flake fix

---------

Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: nammbash <[email protected]>
manuelhsantana pushed a commit that referenced this pull request Jul 10, 2024
* initial commit

* changes

* name changes, added readme

* remove files, new dataloader inherit

* move files, flake changes

* flake fix

---------

Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: manuelhsantana <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants