-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct CWD for ddp subprocesses when using Hydra #2719
Conversation
@omry @romesc-CMU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the integration tests are not testing a Hydra app with ddp why would they cover this scenario?
@omry you're right, we don't have test coverage for this particular case where the training script is built using hydra. i just meant that this doesn't necessarily decrease the test coverage. @Borda could you advise me on how we should add tests for this? Is it ok to merge this as is since we currently don't have tests for hydra scripts, or do we want to use this PR as an opportunity to add some (which will take more time of course)? |
@@ -467,7 +468,12 @@ def spawn_ddp_children(self, model): | |||
env_copy['LOCAL_RANK'] = f'{local_rank}' | |||
|
|||
# start process | |||
proc = subprocess.Popen(command, env=env_copy) | |||
# if hydra is available and initialized, make sure to set the cwd correctly | |||
cwd: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break for non hydra users... no?
shouldn't cwd be set to something for non hydra people?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is the default value for cwd
(https://docs.python.org/3/library/subprocess.html#subprocess.Popen). If None
, it just uses whatever the current value for cwd is.
If cwd is not None, the function changes the working directory to cwd before executing the child. cwd can be a string, bytes or path-like object. In particular, the function looks for executable (or for the first item in args) relative to cwd if the executable path is a relative path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently, we're setting cwd
to None
for everyone, hence the problem for Hydra scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pull the cwd correctly then and also use it for non hydra users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but what should we set it to for non hydra users? do you think we need to save the original cwd and then use that? I feel like that's going down the slippery slope b/c we can't possibly anticipate how a script will change the cwd... for hydra scripts we can be sure b/c of the way it operates...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test for this update?
Codecov Report
@@ Coverage Diff @@
## master #2719 +/- ##
======================================
- Coverage 91% 91% -0%
======================================
Files 82 82
Lines 6784 6789 +5
======================================
Hits 6151 6151
- Misses 633 638 +5 |
What does this PR do?
Fixes #2691. This PR sets the cwd of the ddp subprocesses to the original cwd returned by Hydra (when it's enabled). I looked for specific tests, but I think this will be caught by current integration tests? I did verify this manually with my own training script that I used to reproduce the bug.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃