-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Deepspeed Trainer #10100
feat: Deepspeed Trainer #10100
Conversation
harness/determined/pytorch/_util.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like maybe this should just stay where it is, in pytorch_trial. deepspeed a subpackage of pytorch anyway, and lots of this functionality is very specific to the training loop implementation of pytorch trial, which happens to be shared by deepspeed trial. also, user imports as pytorch.Batch
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, user imports wouldn't change regardless if this is in pytorch_trial.py
versus _util.py
What about renaming to _training_utils.py
? I'd rather keep things that are shareable out of _pytorch_trial.py
rather than in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, user imports wouldn't change regardless if this is in pytorch_trial.py versus _util.py
yeah, that's what i'm saying, user imports are the same anyway so it's a purely internal concern that we move it. i was hoping that deepspeed trial could subclass pytorchtrial at some point, so i guess that's how i was thinking about it.
i'm fine either way on this. i do kinda like _training_utils.py
better... what about _trainer_utils.py
? feel like that's slightly better than _util
.
info = det.get_cluster_info() | ||
|
||
if torch.cuda.is_available(): | ||
deepspeed.init_distributed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing code has:
# We use an environment variable to allow users to enable custom initialization routine for
# distributed training since the pre_execute_hook runs before trial initialization.
manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED")
if not manual_dist_init:
# DeepSpeed's init_distributed handles situations in which only 1 gpu is used and
# also handles multiple calls to init in one process.
deepspeed.init_distributed(auto_mpi_discovery=False)
which we probably need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, but i think the whole point of that manual flag is to not do this if DET_MANUAL_INIT_DISTRIBUTED is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole block would be easier read as:
if distributed_backend.use_deepspeed():
manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED")
if not manual_dist_init:
deepspeed.init_distributed(auto_mpi_discovery=False)
return core.DistributedContext.from_deepspeed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this would break my example, which can be run on gpu without explicitly calling the launcher, via python trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should only get called if local training != true tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
if self.is_chief: | ||
# We assume these stateful objects should be the same across slots and only have | ||
# the chief save them. | ||
util.write_user_code(path, self.env.on_cluster) | ||
|
||
if self.wlsq is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We assume these stateful objects should be the same across slots and only have # the chief save them.
not a big deal, but previously we were saving the training state only on the chief (now TrialState)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepSpeedTrial previously called _save()
on every rank, so each gpu had it's own state. The only thing done once was/is saving user code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the saving of the training state is also gated by the chief worker?
if self.is_chief:
# We assume these stateful objects should be the same across slots and only have
# the chief save them.
util.write_user_code(path, self.env.on_cluster)
if self.wlsq is not None:
with path.joinpath("workload_sequencer.pkl").open("wb") as f:
pickle.dump(self.wlsq.get_state(), f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added state to the top self.is_chief
portion of this method
6ad58a5
to
7212d0e
Compare
6a3de3d
to
d623e28
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## searcher-context-removal #10100 +/- ##
============================================================
+ Coverage 53.72% 54.10% +0.38%
============================================================
Files 1256 1258 +2
Lines 156491 156839 +348
Branches 3616 3616
============================================================
+ Hits 84074 84864 +790
+ Misses 72284 71842 -442
Partials 133 133
Flags with carried forward coverage won't be shown. Click here to find out more.
|
7212d0e
to
dfd7d50
Compare
dfd7d50
to
db8d4b0
Compare
055a3a8
to
77351cf
Compare
Docsite preview being generated for this PR. |
77351cf
to
d7943bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel like the num_micro_batches_per_slot
and train_micro_batch_size_per_gpu
should be get_*
following convention on the rest of the methods here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced @property
with get_
the gpt_neox example still uses the legacy harness trial. we should either get rid of the example if it's not useful, or port it to trainer |
2970bb2
to
544c404
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra part is fine
34fda27
to
cae787d
Compare
cae787d
to
23277e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
refactors max_length dependency and searcher context out of DeepSpeedTrial
46c2de8
into
searcher-context-removal
Co-authored-by: Anda Zhou <[email protected]>
Co-authored-by: Anda Zhou <[email protected]>
Ticket
https://hpe-aiatscale.atlassian.net/browse/MD-501
Description
Add Deepspeed Trainer API
Refactor some pytorch training classes
Reinstate DCGANTrial as an example of DeepSpeed Trainer
Test Plan
README
included withtests/deepspeed/dcgan
. After it's done, continue the run for 100 more batches.Checklist
docs/release-notes/
See Release Note for details.