Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] I can't see any log with pytorch DDP #1126

Closed
2 tasks done
ryul99 opened this issue Nov 5, 2020 · 34 comments
Closed
2 tasks done

[Bug] I can't see any log with pytorch DDP #1126

ryul99 opened this issue Nov 5, 2020 · 34 comments
Labels
bug Something isn't working question Hydra usage question
Milestone

Comments

@ryul99
Copy link

ryul99 commented Nov 5, 2020

🐛 Bug

Description

When I use PyTorch's Distributed Data Parallel I can't see any logs in system out and log file is empty except wandb log like below.

[2020-11-06 01:32:08,629][wandb.internal.internal][INFO] - Internal process exited

Checklist

  • I checked on the latest version of Hydra
  • I created a minimal repro

To reproduce

You can reproduce by running this repo(https://github.com/ryul99/pytorch-project-template/tree/a80f0284c5b22fba2d4892bb906a9bc2b6075838) with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1(DDP with one gpu)
I'm sorry that this is not that minimal code 😅
** Stack trace/error message **

❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning: 
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
  warnings.warn(message=msg, category=UserWarning)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!

Expected Behavior

If you run with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0(not using DDP), you can see many logs like this.

❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning: 
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
  warnings.warn(message=msg, category=UserWarning)
[2020-11-06 01:48:33,990][trainer.py][INFO] - Config:
train:
  name: First_training
  working_dir: /home/ryul99/Workspace/pytorch-project-template
  data:
    train_dir: dataset/meta/train
    test_dir: dataset/meta/test
    file_format: '*.file_extension'
    use_background_generator: true
    divide_dataset_per_gpu: true
  train:
    random_seed: 3750
    num_epoch: 10000
    num_workers: 4
    batch_size: 64
    optimizer:
      mode: adam
      adam:
        lr: 0.001
        betas:
        - 0.9
        - 0.999
    dist:
      master_addr: localhost
      master_port: '12355'
      mode: nccl
      gpus: 0
      timeout: 30
  test:
    num_workers: 4
    batch_size: 64
  model:
    device: cuda
  log:
    use_tensorboard: true
    use_wandb: false
    wandb_init_conf:
      name: ${train.name}
      entity: null
      project: null
    summary_interval: 1
    chkpt_interval: 10
    chkpt_dir: chkpt
  load:
    wandb_load_path: null
    network_chkpt_path: null
    strict_load: false
    resume_state_path: null

[2020-11-06 01:48:33,991][trainer.py][INFO] - Set up train process
[2020-11-06 01:48:33,991][trainer.py][INFO] - BackgroundGenerator is turned off when Distributed running is on
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!
[2020-11-06 01:48:38,886][trainer.py][INFO] - Making train dataloader...
[2020-11-06 01:48:38,905][trainer.py][INFO] - Making test dataloader...
[2020-11-06 01:48:40,366][trainer.py][INFO] - Starting new training run.
[2020-11-06 01:48:40,467][train_model.py][INFO] - Train Loss 2.3010 at step 1
[2020-11-06 01:48:40,473][train_model.py][INFO] - Train Loss 2.3133 at step 2

System information

  • Hydra Version : 1.0.3
  • Python version : 3.8.5
  • Virtual environment type and version : pyenv 1.2.21
  • Operating system : Ubuntu server 18.04

Additional context

Add any other context about the problem here.

@ryul99 ryul99 added the bug Something isn't working label Nov 5, 2020
@ryul99 ryul99 changed the title [Bug] I can't see any logs with pytorch DDP [Bug] I can't see any log with pytorch DDP Nov 5, 2020
@omry
Copy link
Collaborator

omry commented Nov 5, 2020

DDP is not officially supported yet. You are basically on your own here.
Without a minimal repro I am left guessing at what your code looks like. I am sorry but I can't really help.

@omry
Copy link
Collaborator

omry commented Nov 5, 2020

If you can provide a minimal repro it will go a long way toward identifying the root cause and potentially suggesting a workaround.
Closing for now.

  1. Feel free to continue the discussion here.
  2. If you can provide the repro please reopen.

@omry omry closed this as completed Nov 5, 2020
@ryul99
Copy link
Author

ryul99 commented Nov 6, 2020

I made simple DDP codes (https://gist.github.com/ryul99/01c05fe49478241295f980d5c39578de)
The output of this code is this.

/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
[2020-11-06 21:12:19,427][DDP.py][INFO] - Hi! I'm info from main function
[2020-11-06 21:12:19,427][DDP.py][WARNING] - Hi! I'm warning from main function
[2020-11-06 21:12:19,427][DDP.py][ERROR] - Hi! I'm error from main function
Hi! I'm warning from train_loop
Hi! I'm error from train_loop

This is the log file in outputs folder

[2020-11-06 21:12:19,427][DDP.py][INFO] - Hi! I'm info from main function
[2020-11-06 21:12:19,427][DDP.py][WARNING] - Hi! I'm warning from main function
[2020-11-06 21:12:19,427][DDP.py][ERROR] - Hi! I'm error from main function

@omry
Copy link
Collaborator

omry commented Nov 6, 2020

Thanks @ryul99, this is helpful.
we will take a look.

@omry omry reopened this Nov 6, 2020
@omry omry added this to the 1.1.0 milestone Nov 6, 2020
@omry omry added the question Hydra usage question label Nov 6, 2020
@omry
Copy link
Collaborator

omry commented Nov 6, 2020

This is likely the same problem as #1005.
Can you try to reproduce without Hydra where you are configuring the logging in the parent process and not in the spawned processes?

I think the problem is that no one is configuring the logging in the spawned processes. since those are not processes they do not inherit the logging from the parent processes where hydra.main() is configuring the logging.

@ryul99
Copy link
Author

ryul99 commented Nov 7, 2020

I tried this code

import logging
import os
import sys
import hydra
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from omegaconf import OmegaConf

root = logging.getLogger()
root.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter(
    "[%(asctime)s][%(name)s][%(levelname)s] - %(message)s"
)
handler.setFormatter(formatter)
root.addHandler(handler)
logger = logging.getLogger(os.path.basename(__name__))
cfg = OmegaConf.load("DDP_conf.yaml")


def setup(cfg, rank):
    os.environ["MASTER_ADDR"] = cfg.dist.master_addr
    os.environ["MASTER_PORT"] = cfg.dist.master_port
    timeout_sec = 1800
    if cfg.dist.timeout is not None:
        os.environ["NCCL_BLOCKING_WAIT"] = "1"
        timeout_sec = cfg.dist.timeout
    timeout = datetime.timedelta(seconds=timeout_sec)

    # initialize the process group
    dist.init_process_group(
        cfg.dist.mode,
        rank=rank,
        world_size=cfg.dist.gpus,
        timeout=timeout,
    )


def cleanup():
    dist.destroy_process_group()


def distributed_run(fn, cfg):
    mp.spawn(fn, args=(cfg,), nprocs=cfg.dist.gpus, join=True)


def train_loop(rank, cfg):
    logger.info("Hi! I'm info from train_loop")
    logger.warning("Hi! I'm warning from train_loop")
    logger.error("Hi! I'm error from train_loop")


# @hydra.main(config_path="DDP_conf.yaml")
def main():
    logger.info("Hi! I'm info from main function")
    logger.warning("Hi! I'm warning from main function")
    logger.error("Hi! I'm error from main function")
    distributed_run(train_loop, cfg)


if __name__ == "__main__":
    main()

and output is...

[2020-11-07 16:21:26,145][__main__][INFO] - Hi! I'm info from main function
[2020-11-07 16:21:26,145][__main__][WARNING] - Hi! I'm warning from main function
[2020-11-07 16:21:26,145][__main__][ERROR] - Hi! I'm error from main function
[2020-11-07 16:21:26,721][__mp_main__][INFO] - Hi! I'm info from train_loop
[2020-11-07 16:21:26,721][__mp_main__][WARNING] - Hi! I'm warning from train_loop
[2020-11-07 16:21:26,721][__mp_main__][ERROR] - Hi! I'm error from train_loop

I also tried using hydra with configuring root logger

import logging
import os
import sys
import hydra
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from omegaconf import OmegaConf

logger = logging.getLogger()
# logger = logging.getLogger(os.path.basename(__name__))
# cfg = OmegaConf.load("DDP_conf.yaml")


def setup(cfg, rank):
    os.environ["MASTER_ADDR"] = cfg.dist.master_addr
    os.environ["MASTER_PORT"] = cfg.dist.master_port
    timeout_sec = 1800
    if cfg.dist.timeout is not None:
        os.environ["NCCL_BLOCKING_WAIT"] = "1"
        timeout_sec = cfg.dist.timeout
    timeout = datetime.timedelta(seconds=timeout_sec)

    # initialize the process group
    dist.init_process_group(
        cfg.dist.mode,
        rank=rank,
        world_size=cfg.dist.gpus,
        timeout=timeout,
    )


def cleanup():
    dist.destroy_process_group()


def distributed_run(fn, cfg):
    mp.spawn(fn, args=(cfg,), nprocs=cfg.dist.gpus, join=True)


def train_loop(rank, cfg):
    logger.info("Hi! I'm info from train_loop")
    logger.warning("Hi! I'm warning from train_loop")
    logger.error("Hi! I'm error from train_loop")


@hydra.main(config_path="DDP_conf.yaml")
def main(cfg):
    logger.info("Hi! I'm info from main function")
    logger.warning("Hi! I'm warning from main function")
    logger.error("Hi! I'm error from main function")
    distributed_run(train_loop, cfg)


if __name__ == "__main__":
    main()

and the output is...

/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
[2020-11-07 16:26:46,358][root][INFO] - Hi! I'm info from main function
[2020-11-07 16:26:46,358][root][WARNING] - Hi! I'm warning from main function
[2020-11-07 16:26:46,358][root][ERROR] - Hi! I'm error from main function
Hi! I'm warning from train_loop
Hi! I'm error from train_loop

@ryul99
Copy link
Author

ryul99 commented Nov 7, 2020

I tried to adapt to that issue's solution, but this code doesn't work.

from omegaconf import OmegaConf
import torch.multiprocessing as mp
import torch.distributed as dist
from logging.handlers import QueueHandler, QueueListener
import multiprocessing
import datetime
import torch
import hydra
import os
from logging import Handler
import logging


def worker_init(q_list, level):
    logger = logging.getLogger()
    # all records from worker processes go to qh and then into q
    for q in q_list:
        qh = QueueHandler(q)
        logger.setLevel(logging.DEBUG)
        logger.addHandler(qh)


def logger_init(handlers, level):
    ql_list = []
    q_list = []
    for handler in handlers:
        q = multiprocessing.Queue()
        ql = QueueListener(q, handler)
        ql.start()
        q_list.append(q)
        ql_list.append(ql)
        logger = logging.getLogger()
        logger.setLevel(logging.DEBUG)
        # add the handler to the logger so records from this process are handled
        logger.addHandler(handler)
        logger.setLevel(level=level)

    return ql_list, q_list


def distributed_run(fn, a):
    mp.spawn(fn, args=(a,), nprocs=1, join=False)


def train_loop(rank, q_list):
    # setup(cfg, rank)
    logger = logging.getLogger()
    # all records from worker processes go to qh and then into q
    for q in q_list:
        qh = QueueHandler(q)
        logger.setLevel(logging.DEBUG)
        logger.addHandler(qh)
    print('sdfasdfasdfasdf')
    logging.info("Hi! I'm info from train_loop")
    logging.warning("Hi! I'm warning from train_loop")
    logging.error("Hi! I'm error from train_loop")
    print(logger.handlers)
    # cleanup()


@hydra.main()
def main(cfg):
    logger = logging.getLogger()
    q_listener, q = logger_init(logger.handlers, logger.level)
    logger.info("Hi! I'm info from main function")
    logger.warning("Hi! I'm warning from main function")
    logger.error("Hi! I'm error from main function")
    print(logger.handlers)
    distributed_run(train_loop, q)
    for ql in q_listener:
        ql.stop()


if __name__ == "__main__":
    main()

says

[2020-11-07 19:39:24,268][root][INFO] - Hi! I'm info from main function
[2020-11-07 19:39:24,269][root][WARNING] - Hi! I'm warning from main function
[2020-11-07 19:39:24,269][root][ERROR] - Hi! I'm error from main function
[<StreamHandler <stdout> (NOTSET)>, <FileHandler /home/ryul99/Workspace/pytorch-project-template/outputs/2020-11-07/19-39-24/DDP.log (NOTSET)>]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ryul99/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ryul99/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/home/ryul99/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

@SunQpark
Copy link

SunQpark commented Nov 9, 2020

@ryul99 when Hydra is being initialize, it configures the python logging module(with configuration given by hydra-default or user configs). In this case, just calling logging.getLogger() will return properly configured logger which works OK(like in your main function).

However, the logging module in training sub processes (spawned by mp.spawn) is not properly configured, since hydra is not initialized in this sub-processes. My solution was to make a util function and initialize logging module in that.

def is_master():
    return not dist.is_initialized() or dist.get_rank() == 0

def get_logger(name=None):
    if is_master():
        # TODO: also configure logging for sub-processes(not master)
        hydra_conf = OmegaConf.load('.hydra/hydra.yaml')
        logging.config.dictConfig(OmegaConf.to_container(hydra_conf.hydra.job_logging, resolve=True))
    return logging.getLogger(name)

def train_worker(config):
    logger = get_logger('train') # this should be a local variable
    # setup data_loader instances
    ...

By the way, I'm currently also working on enabling Hydra and DDP support for a pytorch project template. You can check this branch for my full implementation for DDP.

@ryul99
Copy link
Author

ryul99 commented Nov 10, 2020

@SunQpark Thank you for your answer! I'll take a look.

@omry
Copy link
Collaborator

omry commented Nov 10, 2020

Thanks @SunQpark.
Instead of reading the Hydra config from the file system, try to use HydraConfig singleton to access the Hydra configuration like here.
I suspect this will not work out of the box because the singletons needs to be initialized in the spawned process.

You can do that by calling this on the main process, and pass the object down to the spawned function.

singleton_state = Singleton.get_state()

And then initializing the Singlestons from the state in the spawned processes function:

Singleton.set_state(singleton_state)

@ryul99
Copy link
Author

ryul99 commented Nov 10, 2020

I solved this issue in this way. (ryul99/pytorch-project-template@3f09593)

def get_logger(cfg, name=None, log_file_path=None):
    # log_file_path is used when unit testing
    if is_logging_process():
        project_root_path = osp.dirname(osp.dirname(osp.abspath(__file__)))
        hydra_conf = OmegaConf.load(osp.join(project_root_path, "config/default.yaml"))

        job_logging_name = None
        for job_logging_name in hydra_conf.defaults:
            if isinstance(job_logging_name, dict):
                job_logging_name = job_logging_name.get("hydra/job_logging")
                if job_logging_name is not None:
                    break
            job_logging_name = None
        if job_logging_name is None:
            job_logging_name = "custom"  # default name

        logging_conf = OmegaConf.load(
            osp.join(
                project_root_path,
                "config/hydra/job_logging",
                job_logging_name + ".yaml",
            )
        )
        if log_file_path is not None:
            logging_conf.handlers.file.filename = log_file_path
        logging.config.dictConfig(OmegaConf.to_container(logging_conf, resolve=True))
    return logging.getLogger(name)

As I understand, @SunQpark 's solution is to set up a hydra logger of the main process in the custom config and load that config in the subprocesses in the manual.
@omry Oh, that's a nice way! But I had trouble with hydra reading the config located in hydra/job_logging/custom.yaml directly. I can't access to that config. (I think hydra conceal config of forming hydra itself) Also, I tried to load the outer config to hydra/job_logging/custom.yaml but it also failed. I didn't try HydraConfig singleton yet but I think this way have same issue. Do you have any nice idea?


I finally understand what omry says 😅 . I can access hydra's config by using singleton..!

@SunQpark
Copy link

Thank you for your suggestion, @omry.
I tried that solution, but failed to pass singleton state to the spawned processes.
Singleton state seems to be a non-picklable object which can't be passed through torch.multiprocessing.spawn.

@jieru-hu
Copy link
Contributor

Thank you for your suggestion, @omry.
I tried that solution, but failed to pass singleton state to the spawned processes.
Singleton state seems to be a non-picklable object which can't be passed through torch.multiprocessing.spawn.

Could you share what errors you get when trying to pickle? we actually pickle SingletonSate in one of our plugins

@omry
Copy link
Collaborator

omry commented Nov 11, 2020

It should be picklable. can you provide a minimal repro of what you are seeing? (Also provide your python version and pip freeze output).

@SunQpark
Copy link

1. Minimal Reproducing Example

I used following code in 3 different environments
env 1, 2 are docker instances with different versions of python and env 3 is my local device(macbook pro).

import sys
import hydra
import pickle as pkl
from hydra.core.singleton import Singleton


@hydra.main(config_path='conf/', config_name='train')
def pickle_state(_):
    state = Singleton.get_state()
    pkl.dumps(state)

if __name__ == '__main__':
    print(sys.version)
    print(hydra.__version__)
    # pylint: disable=no-value-for-parameter
    pickle_state()

2. Results

output from env 1.

3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
[GCC 7.3.0]
1.0.3
Traceback (most recent call last):
  File "repro_states.py", line 10, in pickle_state
    pkl.dumps(state)
_pickle.PicklingError: Can't pickle typing.List[str]: it's not the same object as typing.List

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

output from env 2.

3.7.7 (default, May  7 2020, 21:25:33)
[GCC 7.3.0]
1.0.3
Traceback (most recent call last):
  File "repro_states.py", line 10, in pickle_state
    pkl.dumps(state)
AttributeError: Can't pickle local object 'OmegaConf.register_resolver.<locals>.caching'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

output from env 3.

3.8.6 (default, Oct  8 2020, 14:06:32)
[Clang 12.0.0 (clang-1200.0.32.2)]
1.0.3
Traceback (most recent call last):
  File "repro_states.py", line 10, in pickle_state
    pkl.dumps(state)
AttributeError: Can't pickle local object 'OmegaConf.register_resolver.<locals>.caching'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

3. pip freeze results

env1

absl-py==0.10.0
aiohttp==3.6.2
aioredis==1.3.1
albumentations==0.4.0
alembic==1.0.11
antlr4-python3-runtime==4.8
appdirs==1.4.3
argon2-cffi==20.1.0
asciimatics==1.11.0
asn1crypto==0.24.0
async-generator==1.10
async-timeout==3.0.1
attrs==20.2.0
backcall==0.1.0
bc-dvc-init==0.3.0
beautifulsoup4==4.9.1
bleach==3.2.0
blessings==1.7
boto3==1.9.115
botocore==1.12.207
cachetools==4.1.1
certifi==2019.6.16
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.1
colorama==0.4.1
colorful==0.5.4
conda==4.7.11
conda-package-handling==1.3.11
configobj==5.0.6
configparser==3.8.1
contextlib2==0.5.5
contextvars==2.4
cryptography==2.7
cycler==0.10.0
Cython==0.29.12
databricks-cli==0.8.7
dataclasses==0.7
ddt==1.2.1
decorator==4.4.0
defusedxml==0.6.0
distro==1.4.0
docker==4.0.2
docutils==0.14
dvc==0.54.1
entrypoints==0.3
filelock==3.0.12
fire==0.2.1
Flask==1.1.1
fonttools==4.0.2
freetype-py==2.1.0.post1
funcy==1.13
future==0.17.1
git-url-parse==1.2.2
gitdb==0.6.4
gitdb2==2.0.5
GitPython==3.0.0
google==3.0.0
google-api-core==1.22.2
google-auth==1.21.1
google-auth-oauthlib==0.4.1
googleapis-common-protos==1.52.0
gorilla==0.3.0
gpustat==0.6.0
grandalf==0.6
grpcio==1.31.0
gunicorn==19.9.0
hiredis==1.1.0
humanize==0.5.1
hydra-core==1.0.3
idna==2.8
idna-ssl==1.1.0
imageio==2.6.1
imgaug==0.2.6
immutables==0.14
importlib-metadata==1.7.0
importlib-resources==3.0.0
inflect==2.1.0
ipykernel==5.3.4
ipython==7.7.0
ipython-genutils==0.2.0
itsdangerous==1.1.0
jedi==0.13.3
Jinja2==2.10.1
jmespath==0.9.4
json5==0.9.5
jsonpath-ng==1.4.3
jsonschema==3.2.0
jupyter-client==6.1.7
jupyter-core==4.6.3
jupyterlab==2.2.8
jupyterlab-pygments==0.1.1
jupyterlab-server==1.2.0
kiwisolver==1.1.0
libarchive-c==2.8
lmdb==0.97
Mako==1.1.0
Markdown==3.2.2
MarkupSafe==1.1.1
matplotlib==3.1.1
mistune==0.8.4
mkl-fft==1.0.12
mkl-random==1.0.2
mkl-service==2.0.2
mlflow==1.2.0
msgpack==1.0.0
multidict==4.7.6
nanotime==0.5.2
nbclient==0.5.0
nbconvert==6.0.3
nbformat==5.0.7
nest-asyncio==1.4.0
networkx==2.3
notebook==6.1.4
numpy==1.16.4
nvidia-ml-py3==7.352.0
oauthlib==3.1.0
olefile==0.46
omegaconf==2.0.2
opencensus==0.7.10
opencensus-context==0.1.1
opencv-python==4.1.2.30
opencv-python-headless==4.1.1.26
packaging==20.4
pandas==0.25.0
pandocfilters==1.4.2
parmap==1.5.2
parso==0.5.0
pathspec==0.5.9
pbr==5.4.2
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.1
Pillow-SIMD==6.0.0.post0
ply==3.11
Polygon3==3.0.8
prometheus-client==0.8.0
prompt-toolkit==2.0.9
protobuf==3.9.1
psutil==5.7.2
ptyprocess==0.6.0
py-spy==0.3.3
pyarrow==1.0.1
pyasn1==0.4.6
pyasn1-modules==0.2.8
pycosat==0.6.3
pycparser==2.19
pyfiglet==0.8.post1
Pygments==2.4.2
pyOpenSSL==19.0.0
pyparsing==2.4.2
pyrsistent==0.16.0
PySocks==1.7.0
python-dateutil==2.8.0
python-editor==1.0.4
pytorch-ranger==0.1.1
pytz==2019.2
PyWavelets==1.0.3
PyYAML==5.1.1
pyzmq==19.0.2
querystring-parser==1.2.4
ray==0.8.7
redis==3.4.1
requests==2.22.0
requests-oauthlib==1.3.0
rsa==4.6
ruamel-yaml==0.15.46
s3transfer==0.2.1
schema==0.7.0
scikit-image==0.16.1
scipy==1.3.0
Send2Trash==1.5.0
shortuuid==0.5.0
simplejson==3.16.0
six==1.12.0
smmap==0.9.0
smmap2==2.0.5
soupsieve==2.0.1
SQLAlchemy==1.3.6
sqlparse==0.3.0
tabulate==0.8.3
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
torch==1.2.0
torch-optimizer==0.0.1a16
torchvision==0.4.0
tornado==6.0.4
tqdm==4.32.1
traitlets==4.3.2
treelib==1.5.5
typing==3.6.4
typing-extensions==3.7.4.3
urllib3==1.24.2
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.56.0
Werkzeug==0.15.5
yarl==1.5.1
zc.lockfile==2.0
zipp==3.1.0

env2

absl-py==0.10.0
aiohttp==3.7.2
aiohttp-cors==0.7.0
aioredis==1.3.1
albumentations==0.5.0
alembic==1.4.1
antlr4-python3-runtime==4.8
appdirs==1.4.4
apted==1.0.3
async-timeout==3.0.1
atpublic==2.0
attrs==20.2.0
azure-core==1.8.1
azure-storage-blob==12.4.0
backcall==0.2.0
bc-dvc-init==0.3.0
beautifulsoup4==4.9.1
blessings==1.7
boto3==1.14.57
botocore==1.17.57
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.0
chardet==3.0.4
click==7.1.2
cloudpickle==1.6.0
colorama==0.4.3
colorful==0.5.4
commonmark==0.9.1
conda==4.8.3
conda-build==3.18.11
conda-package-handling==1.7.0
configobj==5.0.6
cryptography==2.9.2
cycler==0.10.0
databricks-cli==0.11.0
decorator==4.4.2
dictdiffer==0.8.1
dill==0.3.2
diskcache==5.0.3
Distance==0.1.3
distro==1.5.0
docker==4.3.1
docutils==0.15.2
dpath==2.0.1
dvc==1.6.6
entrypoints==0.3
filelock==3.0.12
Flask==1.1.2
flatten-dict==0.3.0
flufl.lock==3.2
funcy==1.14
future==0.18.2
git-url-parse==1.2.2
gitdb==4.0.5
GitPython==3.1.8
glob2==0.7
google==3.0.0
google-api-core==1.23.0
google-auth==1.22.1
google-auth-oauthlib==0.4.1
googleapis-common-protos==1.52.0
gorilla==0.3.0
gpustat==0.6.0
grandalf==0.6
grpcio==1.32.0
gunicorn==20.0.4
hiredis==1.1.0
hydra-core==1.0.3
idna==2.9
imageio==2.9.0
imgaug==0.4.0
importlib-metadata==2.0.0
importlib-resources==3.0.0
ipython @ file:///tmp/build/80754af9/ipython_1593447368578/work
ipython-genutils==0.2.0
isodate==0.6.0
itsdangerous==1.1.0
jedi @ file:///tmp/build/80754af9/jedi_1592841891421/work
Jinja2==2.11.2
jmespath==0.10.0
jsonpath-ng==1.5.2
jsonschema==3.2.0
kiwisolver==1.2.0
libarchive-c==2.9
lmdb==1.0.0
lxml==4.6.1
Mako==1.1.3
Markdown==3.3.2
MarkupSafe @ file:///tmp/build/80754af9/markupsafe_1594371495811/work
matplotlib==3.3.2
mkl-fft==1.1.0
mkl-random==1.1.1
mkl-service==2.3.0
mlflow==1.11.0
msgpack==1.0.0
msrest==0.6.19
multidict==5.0.0
nanotime==0.5.2
networkx==2.4
numpy==1.18.5
nvidia-ml-py3==7.352.0
oauthlib==3.1.0
olefile==0.46
omegaconf==2.0.3
opencensus==0.7.11
opencensus-context==0.1.2
opencv-python==4.4.0.44
opencv-python-headless==4.4.0.44
packaging==20.4
pandas==1.1.2
parso==0.7.0
pathlib2==2.3.5
pathspec==0.8.0
pbr==5.5.0
petastorm==0.9.6
pexpect @ file:///tmp/build/80754af9/pexpect_1594383317248/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1594384075987/work
Pillow==8.0.0
Pillow-SIMD==7.0.0.post3
pkginfo==1.5.0.1
ply==3.11
Polygon3==3.0.8
prometheus-client==0.8.0
prometheus-flask-exporter==0.17.0
prompt-toolkit==3.0.5
protobuf==3.13.0
psutil==5.7.0
ptyprocess==0.6.0
py-spy==0.3.3
py4j==0.10.9
pyarrow==2.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycosat==0.6.3
pycparser==2.20
pydot==1.4.1
Pygments==2.6.1
pygtrie==2.3.2
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyrsistent==0.17.3
PySocks==1.7.1
pyspark==3.0.1
python-dateutil==2.8.1
python-editor==1.0.4
pytorch-ranger==0.1.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
pyzmq==19.0.2
querystring-parser==1.2.4
ray==1.0.0
redis==3.4.1
requests==2.24.0
requests-oauthlib==1.3.0
rich==6.1.1
rsa==4.6
ruamel-yaml==0.15.87
ruamel.yaml.clib==0.2.2
s3transfer==0.3.3
scikit-image==0.17.2
scipy==1.5.3
Shapely==1.7.1
shortuuid==1.0.1
shtab==1.3.1
six==1.14.0
smmap==3.0.4
soupsieve==2.0.1
SQLAlchemy==1.3.13
sqlparse==0.3.1
tabulate==0.8.7
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tifffile==2020.10.1
toml==0.10.1
torch==1.6.0
torch-optimizer==0.0.1a16
torchvision==0.7.0
tqdm==4.46.0
traitlets==4.3.3
typing-extensions==3.7.4.3
urllib3==1.25.8
voluptuous==0.11.7
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
websocket-client==0.57.0
Werkzeug==1.0.1
yarl==1.6.2
zc.lockfile==2.0
zipp==3.3.1

env3

alabaster==0.7.12
albumentations==0.4.5
alembic==1.4.1
altair==4.1.0
amqp==2.5.2
anaconda-client==1.7.2
anaconda-navigator==1.9.7
anaconda-project==0.8.3
appdirs==1.4.3
appnope==0.1.0
appscript==1.0.1
asn1crypto==1.0.1
aspy.yaml==1.3.0
astor==0.8.1
astroid==2.3.1
astropy==3.2.2
atomicwrites==1.3.0
attrs==19.2.0
Babel==2.7.0
backcall==0.1.0
backports.functools-lru-cache==1.5
backports.os==0.1.1
backports.shutil-get-terminal-size==1.0.0
backports.tempfile==1.0
backports.weakref==1.0.post1
base58==2.0.0
beautifulsoup4==4.8.0
billiard==3.6.1.0
bitarray==1.0.1
bkcharts==0.2
black==19.10b0
bleach==3.1.0
blinker==1.4
bokeh==1.3.4
boto==2.49.0
boto3==1.14.2
botocore==1.17.2
Bottleneck==1.2.1
braincloud==0.1.0
cachetools==4.1.0
celery==4.3.0
certifi==2019.9.11
cffi==1.12.3
cfgv==2.0.1
chardet==3.0.4
Click==7.0
cloudpickle==1.2.2
clyent==1.2.2
colorama==0.4.1
conda==4.8.1
conda-build==3.18.9
conda-package-handling==1.6.0
conda-verify==3.4.2
configparser==5.0.0
contextlib2==0.6.0
cryptography==2.7
cycler==0.10.0
Cython==0.29.13
cytoolz==0.10.0
dask==2.5.2
databricks-cli==0.10.0
decorator==4.4.0
defusedxml==0.6.0
disjoint-set==0.6.3
distributed==2.5.2
docker==4.2.0
docutils==0.15.2
entrypoints==0.3
enum-compat==0.0.3
et-xmlfile==1.0.1
fastcache==1.1.0
filelock==3.0.12
fire==0.2.1
flake8==3.7.9
Flask==1.1.1
Flask-Cors==3.0.8
fsspec==0.5.2
future==0.17.1
gevent==1.4.0
gitdb==4.0.2
GitPython==3.1.0
glob2==0.7
gmpy2==2.0.8
gorilla==0.3.0
greenlet==0.4.15
gunicorn==20.0.4
h5py==2.9.0
HeapDict==1.0.1
html5lib==1.0.1
hydra-core==1.0.0rc1
identify==1.4.7
idna==2.8
ImageHash==4.1.0
imageio==2.6.0
imagesize==1.1.0
imgaug==0.2.6
importlib-metadata==0.23
ipykernel==5.1.2
ipython==7.8.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
isort==4.3.21
itsdangerous==1.1.0
jdcal==1.4.1
jedi==0.15.1
Jinja2==2.10.3
jmespath==0.10.0
joblib==0.13.2
json5==0.8.5
jsonschema==3.0.2
jupyter==1.0.0
jupyter-client==5.3.3
jupyter-console==6.0.0
jupyter-core==4.5.0
jupyterlab==1.1.4
jupyterlab-server==1.0.6
keyring==18.0.0
kiwisolver==1.1.0
kombu==4.6.6
lazy-object-proxy==1.4.2
libarchive-c==2.8
lief==0.9.0
llvmlite==0.29.0
lmdb==0.98
locket==0.2.0
lxml==4.4.1
Mako==1.1.2
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
mistune==0.8.4
mkl-fft==1.0.14
mkl-random==1.1.0
mkl-service==2.3.0
mlflow==1.7.2
mock==3.0.5
more-itertools==7.2.0
mpmath==1.1.0
msgpack==0.6.1
multipledispatch==0.6.0
navigator-updater==0.2.1
nbconvert==5.6.0
nbformat==4.4.0
networkx==2.3
nltk==3.4.5
nodeenv==1.3.3
nose==1.3.7
notebook==6.0.1
numba==0.45.1
numexpr==2.7.0
numpy==1.17.2
numpydoc==0.9.1
olefile==0.46
omegaconf==2.0.1rc6
opencv-python-headless==4.1.1.26
openpyxl==3.0.0
packaging==19.2
pandas==0.25.1
pandocfilters==1.4.2
parmap==1.5.2
parso==0.5.1
partd==1.0.0
path.py==12.0.1
pathlib2==2.3.5
pathspec==0.6.0
pathtools==0.1.2
patsy==0.5.1
pep8==1.7.1
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.0
pkginfo==1.5.0.1
pluggy==0.13.0
ply==3.11
pre-commit==1.20.0
prometheus-client==0.7.1
prometheus-flask-exporter==0.13.0
prompt-toolkit==2.0.10
protobuf==3.11.3
psutil==5.6.3
ptyprocess==0.6.0
py==1.8.0
pyaes==1.6.1
pycodestyle==2.5.0
pycosat==0.6.3
pycparser==2.19
pycrypto==2.6.1
pycurl==7.43.0.3
pydeck==0.3.1
pyflakes==2.1.1
Pygments==2.4.2
pylint==2.4.4
pyobjc-core==6.1
pyobjc-framework-Cocoa==6.1
pyobjc-framework-Security==6.1
pyodbc==4.0.27
pyOpenSSL==19.0.0
pyparsing==2.4.2
pyrsistent==0.15.4
PySocks==1.7.1
pytest==5.2.1
pytest-arraydiff==0.3
pytest-astropy==0.5.0
pytest-doctestplus==0.4.0
pytest-openfiles==0.4.0
pytest-remotedata==0.3.2
python-dateutil==2.8.1
python-editor==1.0.4
pytz==2019.3
PyWavelets==1.0.3
PyYAML==5.1.2
pyzmq==18.1.0
QtAwesome==0.6.0
qtconsole==4.5.5
QtPy==1.9.0
querystring-parser==1.2.4
regex==2019.11.1
requests==2.22.0
rope==0.14.0
ruamel-yaml==0.15.46
s3transfer==0.3.3
scikit-image==0.15.0
scikit-learn==0.21.3
scipy==1.3.1
seaborn==0.9.0
Send2Trash==1.5.0
Shapely==1.7.1
simplegeneric==0.8.1
simplejson==3.17.0
singledispatch==3.4.0.3
six==1.15.0
smmap==3.0.1
snowballstemmer==2.0.0
sortedcollections==1.1.2
sortedcontainers==2.1.0
soupsieve==1.9.3
Sphinx==2.2.0
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
sphinxcontrib-websupport==1.1.2
spyder==3.3.6
spyder-kernels==0.5.2
SQLAlchemy==1.3.9
sqlparse==0.3.1
statsmodels==0.10.1
streamlit==0.60.0
sympy==1.4
tables==3.5.2
tabulate==0.8.7
tblib==1.4.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
toml==0.10.0
toolz==0.10.0
torch==1.5.0
torchvision==0.6.0
tornado==5.1.1
tqdm==4.44.1
traitlets==4.3.3
typed-ast==1.4.0
typing-extensions==3.7.4.2
tzlocal==2.1
unicodecsv==0.14.1
urllib3==1.24.2
validators==0.15.0
vine==1.3.0
virtualenv==16.7.8
watchdog==0.10.2
wcwidth==0.1.7
webencodings==0.5.1
websocket-client==0.57.0
Werkzeug==0.16.0
widgetsnbextension==3.5.1
wrapt==1.11.2
wurlitzer==1.0.3
xlrd==1.2.0
XlsxWriter==1.2.1
xlwings==0.15.10
xlwt==1.3.0
zict==1.0.0
zipp==0.6.0

@jieru-hu
Copy link
Contributor

jieru-hu commented Nov 12, 2020

Thx for the repro! I was able to reproduce the stack trace locally.

Could you use cloudpickle instead for pickle? That should solve the issue. I replaced the library and was able to pickle the object with no issue.

import cloudpickle as pkl

@SunQpark
Copy link

SunQpark commented Nov 12, 2020

@jieru-hu cloudpickle works fine on all environments!
but I got following error on python 3.6, before updating cloudpickle version from 1.2.1 to 1.6.0. (it works fine after update)

3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
[GCC 7.3.0]
1.0.3
Traceback (most recent call last):
  File "repro_states.py", line 10, in pickle_state
    pkl.dumps(state)
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 1108, in dumps
    cp.dump(obj)
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 473, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 547, in save_function
    return self.save_function_tuple(obj)
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 747, in save_function_tuple
    save(state)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File "/opt/conda/lib/python3.6/pickle.py", line 805, in _batch_appends
    save(x)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 547, in save_function
    return self.save_function_tuple(obj)
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 747, in save_function_tuple
    save(state)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.6/pickle.py", line 507, in save
    self.save_global(obj, rv)
  File "/opt/conda/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 859, in save_global
    Pickler.save_global(self, obj, name=name)
  File "/opt/conda/lib/python3.6/pickle.py", line 927, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle typing.Union[str, NoneType]: it's not the same object as typing.Union

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@SunQpark
Copy link

SunQpark commented Nov 12, 2020

Is there something I can try, to make torch.multiprocessing.spawn use cloudpickle instead of pickle internally??
I think pickling singleton state into bytes before passing to the spawn method will work, but there might be a better (single step) solution

@jieru-hu
Copy link
Contributor

thx for letting me know @SunQpark
the pickling issue with python 3.6 is a known one #428

Is there something I can try, to make torch.multiprocessing.spawn use cloudpickle instead of pickle internally??

Unfortunately, I'm not sure is there's an easier solution here.

@omry
Copy link
Collaborator

omry commented Nov 12, 2020

@SunQpark, I suggest that you do not use Python 3.6, it will be discontinued in about a year and I can't think of a reason to prefer it over a newer version.

@omry
Copy link
Collaborator

omry commented Nov 12, 2020

Is there something I can try, to make torch.multiprocessing.spawn use cloudpickle instead of pickle internally??

Yes, PyTorch does not depend on Cloudpickle so I don't think there is a clean way to get it to use it.
Your proposed idea will work - serialize with cloudpickle yourself and pass it as a string.

@briankosw
Copy link
Contributor

The way I see it, it's fundamentally due to the nature of torch.multiprocessing.spawn (python.multiprocessing.spawn to be more specific) and how program states are being passed on the newly spawned processes. The best solution would be to 'hijack' the spawning mechanism so that Hydra is automatically configured in the newly spawned process (or perhaps a hydra.log?), but otherwise the get_logger methods by @SunQpark and @ryul99 look like good solutions

@ryul99
Copy link
Author

ryul99 commented Nov 16, 2020

Finally, I solved this issue in this way.
define get_logger function
In get logger, the logger is configured as same as hydra's job_logging config
https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/utils/utils.py#L22-L28

def get_logger(cfg, name=None):
    # log_file_path is used when unit testing
    if is_logging_process():
        logging.config.dictConfig(
            OmegaConf.to_container(cfg.job_logging_cfg, resolve=True)
        )
        return logging.getLogger(name)

add hydra's job logging config to user's config at trainer.py
https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/trainer.py#L143-L144

@hydra.main(config_path="config", config_name="default")
def main(hydra_cfg):
    hydra_cfg.device = hydra_cfg.device.lower()
    with open_dict(hydra_cfg):
        hydra_cfg.job_logging_cfg = HydraConfig.get().job_logging

In unit test
In a unit test, I failed to load hydra's job logging config with HydraConfig.get(). So I used @SunQpark 's way.
https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/tests/test_case.py#L33-L58

        # load job_logging_cfg
        project_root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
        hydra_conf = OmegaConf.load(
            os.path.join(project_root_path, "config/default.yaml")
        )
        job_logging_name = None
        for job_logging in hydra_conf.defaults:
            job_logging_name = job_logging.get("hydra/job_logging")
            if job_logging_name is not None:
                break
        job_logging_cfg_path = os.path.join(
            project_root_path,
            "config/hydra/job_logging",
            str(job_logging_name) + ".yaml",
        )
        if os.path.exists(job_logging_cfg_path):
            job_logging_cfg = OmegaConf.load(job_logging_cfg_path)
        else:
            job_logging_cfg = dict()
        with open_dict(self.cfg):
            self.cfg.job_logging_cfg = job_logging_cfg
        self.cfg.job_logging_cfg.handlers.file.filename = str(
            (self.working_dir / "trainer.log").resolve()
        )
        # set logger
        self.logger = get_logger(self.cfg, os.path.basename(__file__))

config file
This way works because we configure the same logger (file and sys out) at both the main process (by hydra) and sub process (by get_logger)
So in this way, you should use hydra/job_logging in custom config (because we should know how to config logger specially at subprocess)
https://github.com/ryul99/pytorch-project-template/tree/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/config

@omry
Copy link
Collaborator

omry commented Nov 16, 2020

@briankosw, can you try to hack a prototype PR attempting this?
something like this might conflict with the joblib plugin.

@briankosw
Copy link
Contributor

I'll create a separate issue to address this.

@trias702
Copy link

Just want to say that the official Commit to solve this (facebookresearch/fairseq@9a1c497) does not appear to work when replicating the wav2vec2 hydra scripts, and throws the following (at least for me):

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fairseq_cli/hydra_train.py", line 34, in hydra_main
    cfg.job_logging_cfg = OmegaConf.to_container(HydraConfig.get().job_logging, resolve=True)
omegaconf.errors.ConfigKeyError: str interpolation key 'hydra.job.name' not found

There seems to be issues with resolving ${hydra.job.name} for some reason

@omry
Copy link
Collaborator

omry commented Jan 27, 2021

This is an issue with 1.0.5 that will be addressed in 1.0.6.
Downgrade to Hydra 1.0.4 for now or try to install 1.0.6 from the 1.0_branch. (In fact, I would be happy for someone to verify that the fix in that branch is really solving the issue, please let me know otherwise).

@trias702
Copy link

Thank you!

Confirmed working correctly after downgrading to 1.0.4

@omry
Copy link
Collaborator

omry commented Jan 27, 2021

Thank you!

Confirmed working correctly after downgrading to 1.0.4

This much I already knew. can you also test 1.0.6 from the repo?

@trias702
Copy link

Yes, but it will need be to tomorrow as I just started a long running wav2vec fit using 1.0.4. Soon as it finishes, I'll test 1.0.6 from the repo and report back.

@trias702
Copy link

I was able to test the 1.0_branch and I can confirm that the bug is fixed in that branch. However, there is no 1.0.6 tag in that branch, it only goes up to 1.0.5, and you kept telling me to test 1.0.6, but it doesn't exist.

I used the following command to checkout:

git clone --single-branch --branch 1.0_branch https://github.com/facebookresearch/hydra.git

And then I built and installed the wheel from there. It gave me version 1.0.5 for this wheel. I then ran my hydra train code and I did NOT get the ${hydra.job.name} error from before.

So it does indeed appear to be fixed in that branch.

@omry
Copy link
Collaborator

omry commented Jan 28, 2021

Thanks for testing.
Sorry for the confusion, 1.0.6 is not yet released so there is no tag and the version in the repo is still that of 1.0.5.
You installation procedure was fine, you can also skip building a wheel and just pip install the checkout directory itself (or even skip the checkout and install straight from GitHub but I never remember the syntax for it).

@omry
Copy link
Collaborator

omry commented Feb 18, 2021

I am going under the assumption that this is a dup of #1005.

@omry omry closed this as completed Feb 18, 2021
@omry
Copy link
Collaborator

omry commented Feb 18, 2021

After looking at #1005, I am not sure anymore.
Regardless, this issue has become a hub of unrelated discussions, and I could not find a minimal repro.
Happy to look at this if someone produces a minimal example using DDP with Hydra that suffers from logging issues.
It should be minimal (a file or two, see the example in #1005 for a good minimal repro).
If you can produce it, please file a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Hydra usage question
Projects
None yet
Development

No branches or pull requests

6 participants