-
-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] I can't see any log with pytorch DDP #1126
Comments
DDP is not officially supported yet. You are basically on your own here. |
If you can provide a minimal repro it will go a long way toward identifying the root cause and potentially suggesting a workaround.
|
I made simple DDP codes (https://gist.github.com/ryul99/01c05fe49478241295f980d5c39578de)
This is the log file in
|
Thanks @ryul99, this is helpful. |
This is likely the same problem as #1005. I think the problem is that no one is configuring the logging in the spawned processes. since those are not processes they do not inherit the logging from the parent processes where hydra.main() is configuring the logging. |
I tried this code
and output is...
I also tried using hydra with configuring root logger
and the output is...
|
I tried to adapt to that issue's solution, but this code doesn't work.
says
|
@ryul99 when Hydra is being initialize, it configures the python logging module(with configuration given by hydra-default or user configs). In this case, just calling However, the logging module in training sub processes (spawned by def is_master():
return not dist.is_initialized() or dist.get_rank() == 0
def get_logger(name=None):
if is_master():
# TODO: also configure logging for sub-processes(not master)
hydra_conf = OmegaConf.load('.hydra/hydra.yaml')
logging.config.dictConfig(OmegaConf.to_container(hydra_conf.hydra.job_logging, resolve=True))
return logging.getLogger(name)
def train_worker(config):
logger = get_logger('train') # this should be a local variable
# setup data_loader instances
... By the way, I'm currently also working on enabling Hydra and DDP support for a pytorch project template. You can check this branch for my full implementation for DDP. |
@SunQpark Thank you for your answer! I'll take a look. |
Thanks @SunQpark. You can do that by calling this on the main process, and pass the object down to the spawned function. singleton_state = Singleton.get_state() And then initializing the Singlestons from the state in the spawned processes function:
|
I solved this issue in this way. (ryul99/pytorch-project-template@3f09593)
As I understand, @SunQpark 's solution is to set up a hydra logger of the main process in the custom config and load that config in the subprocesses in the manual. I finally understand what omry says 😅 . I can access hydra's config by using singleton..! |
Thank you for your suggestion, @omry. |
Could you share what errors you get when trying to pickle? we actually pickle SingletonSate in one of our plugins |
It should be picklable. can you provide a minimal repro of what you are seeing? (Also provide your python version and pip freeze output). |
1. Minimal Reproducing ExampleI used following code in 3 different environments import sys
import hydra
import pickle as pkl
from hydra.core.singleton import Singleton
@hydra.main(config_path='conf/', config_name='train')
def pickle_state(_):
state = Singleton.get_state()
pkl.dumps(state)
if __name__ == '__main__':
print(sys.version)
print(hydra.__version__)
# pylint: disable=no-value-for-parameter
pickle_state() 2. Resultsoutput from env 1.
output from env 2.
output from env 3.
3. pip freeze resultsenv1
env2
env3
|
Thx for the repro! I was able to reproduce the stack trace locally. Could you use
|
@jieru-hu cloudpickle works fine on all environments!
|
Is there something I can try, to make |
@SunQpark, I suggest that you do not use Python 3.6, it will be discontinued in about a year and I can't think of a reason to prefer it over a newer version. |
Yes, PyTorch does not depend on Cloudpickle so I don't think there is a clean way to get it to use it. |
The way I see it, it's fundamentally due to the nature of |
Finally, I solved this issue in this way.
add hydra's job logging config to user's config at trainer.py
In unit test
config file |
@briankosw, can you try to hack a prototype PR attempting this? |
I'll create a separate issue to address this. |
Just want to say that the official Commit to solve this (facebookresearch/fairseq@9a1c497) does not appear to work when replicating the wav2vec2 hydra scripts, and throws the following (at least for me):
There seems to be issues with resolving |
This is an issue with 1.0.5 that will be addressed in 1.0.6. |
Thank you! Confirmed working correctly after downgrading to 1.0.4 |
This much I already knew. can you also test 1.0.6 from the repo? |
Yes, but it will need be to tomorrow as I just started a long running wav2vec fit using 1.0.4. Soon as it finishes, I'll test 1.0.6 from the repo and report back. |
I was able to test the 1.0_branch and I can confirm that the bug is fixed in that branch. However, there is no 1.0.6 tag in that branch, it only goes up to 1.0.5, and you kept telling me to test 1.0.6, but it doesn't exist. I used the following command to checkout:
And then I built and installed the wheel from there. It gave me version 1.0.5 for this wheel. I then ran my hydra train code and I did NOT get the So it does indeed appear to be fixed in that branch. |
Thanks for testing. |
I am going under the assumption that this is a dup of #1005. |
After looking at #1005, I am not sure anymore. |
🐛 Bug
Description
When I use PyTorch's Distributed Data Parallel I can't see any logs in system out and log file is empty except wandb log like below.
Checklist
To reproduce
You can reproduce by running this repo(https://github.com/ryul99/pytorch-project-template/tree/a80f0284c5b22fba2d4892bb906a9bc2b6075838) with
python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1
(DDP with one gpu)I'm sorry that this is not that minimal code 😅
** Stack trace/error message **
Expected Behavior
If you run with
python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0
(not using DDP), you can see many logs like this.System information
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: