[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

stas00 · 2020-12-02T20:32:38Z

I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?). Note that tensorflow requires cuda-11.1 to work with rtx-30*. So while I do have 11.0 and 11.1 installed, the builder can't find 11.0 automatically.

Can probably remove this when cuda-11.2 comes out and we get pytorch fully supporting Ampere - until then pytorch can't be built with cuda-11.1.

I verified that I was able to build all options with:

DS_BUILD_OPS=1 pip install deepspeed -v .

ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0.dev20201202+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7+7a75f8b, 7a75f8b, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0

Otherwise with rtx-3090 I was getting:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

… built with cuda-11.0 I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?) Can probably remove this when cuda-11.2 comes out and we get pytorch supporting Ampere - until then pytorch can't be built with cuda-11.1. I verified that I was able to build all options with: ``` DS_BUILD_OPS=1 pip install deepspeed -v . ``` ``` ds_report -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch'] torch version .................... 1.8.0.dev20201202+cu110 torch cuda version ............... 11.0 nvcc version ..................... 11.1 deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.3.7+7a75f8b, 7a75f8b, master deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0 ``` Otherwise with rtx-3090 I was getting: `RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8`

jeffra · 2020-12-03T00:16:04Z

Thanks for this @stas00. Would you also be able to run our unit tests in your environment? I don't readily have access to a cuda 11.1 machine (most up to date I have is 11.0) and don't have any access to rtx-3090s either.

pip install -r requirements/requirements-dev.txt
pytest --forked tests/unit/

stas00 · 2020-12-03T00:58:06Z

I tried but getting lots of errors,

tests/unit/test_activation_checkpointing.py::test_ckpt_inputs1_outputs1 PASSED                                                                            [  0%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs1[mask0] PASSED                                                                     [  0%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs1[mask1] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs2[mask0] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs2[mask1] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs3[mask0] PASSED                                                                     [  2%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs3[mask1] PASSED                                                                     [  2%]
tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED                                                                                [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer FAILED                                                                                  [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[1-False] FAILED                                                                          [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-False] FAILED                                                                          [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-True] PASSED                                                                           [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[1-False] FAILED                                                                       [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-False] FAILED                                                                       [  5%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-True] PASSED                                                                        [  5%]

I aborted as it was too slow and already clear that something isn't right - but I have never run this test suite before so perhaps the failures are unrelated. Your CI seems to be running all those fine.

Here is error log so far:

=========================================================================== FAILURES ============================================================================
_______________________________________________________________ test_checkpoint_unfused_optimizer _______________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:45:19,843] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:45:19,844] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:45:19,844] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:45:19,869] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 16:45:19,869] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-12-02 16:45:19,869] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 16:45:19,869] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 16:45:19,869] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-12-02 16:45:19,871] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 16:45:19,875] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 65536.0, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2.0, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'step': 1, 'exp_avg': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), 'exp_avg_sq': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')}, 1: {'step': 1, 'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}}, 'param_groups': [{'lr': 0.00015, 'bias_correction': True, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0.0, 'max_grad_norm': 0.0, 'max_coeff': 10.0, 'min_coeff': 0.01, 'params': [0, 1]}]}, 'fp32_groups': [[tensor([[ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
          2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01],
        [ 1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
          3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01],
        [ 8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
          1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01],
        [ 1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
          1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02],
        [ 1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
          2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01],
        [ 1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02,
         -2.0178e-01, -3.0640e-01,  2.5098e-01,  3.2349e-03,  2.1497e-01],
        [-2.2705e-01,  2.4646e-01, -5.7312e-02,  1.8726e-01, -2.4475e-01,
          2.3169e-01,  1.2122e-01,  2.0642e-01,  1.3562e-01,  5.7495e-02],
        [-1.3403e-01, -1.2054e-01, -2.9395e-01,  2.5049e-01,  1.6028e-01,
          1.5732e-02, -2.5415e-01, -7.2388e-02, -2.4878e-01, -1.1554e-01],
        [ 1.8079e-01, -2.3523e-01,  7.6172e-02, -2.5464e-01, -1.9128e-01,
         -1.2091e-01, -2.0068e-01,  2.5293e-01, -1.0040e-01, -1.0486e-01],
        [ 1.2878e-01,  1.7493e-01, -1.3281e-01,  6.5552e-02, -2.8687e-01,
         -2.7173e-01, -7.0557e-02, -1.6553e-01, -2.8809e-01, -3.3783e-02]],
       device='cuda:0', requires_grad=True), tensor([-0.0440,  0.0008, -0.0169, -0.3091, -0.2874, -0.1752, -0.1694,  0.2095,
         0.2372,  0.1008], device='cuda:0', requires_grad=True)]]}
[2020-12-02 16:45:19,876] [INFO] [engine.py:456:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = OneCycle
[2020-12-02 16:45:19,876] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.OneCycle object at 0x7fb79d0bebb0>
[2020-12-02 16:45:19,876] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.85, 0.99)]
[2020-12-02 16:45:19,876] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bef40>
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_clipping ............ 1.0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_name ............... lamb
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   scheduler_name ............... OneCycle
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   scheduler_params ............. {'cycle_first_step_size': 1000, 'cycle_first_stair_count': 500, 'cycle_second_step_size': 1000, 'cycle_second_stair_count': 500, 'decay_step_size': 1000, 'cycle_min_lr': 0.0001, 'cycle_max_lr': 0.001, 'decay_lr_rate': 0.001, 'cycle_min_mom': 0.85, 'cycle_max_mom': 0.99, 'decay_mom_rate': 0.0}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bef10>
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-02 16:45:19,877] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "lr":0.00015
        },
        "type":"Lamb"
    },
    "scheduler":{
        "params":{
            "cycle_first_stair_count":500,
            "cycle_first_step_size":1000,
            "cycle_max_lr":0.001,
            "cycle_max_mom":0.99,
            "cycle_min_lr":0.0001,
            "cycle_min_mom":0.85,
            "cycle_second_stair_count":500,
            "cycle_second_step_size":1000,
            "decay_lr_rate":0.001,
            "decay_mom_rate":0.0,
            "decay_step_size":1000
        },
        "type":"OneCycle"
    },
    "steps_per_print":1,
    "train_batch_size":2
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/lamb/fused_lamb_cuda_kernel.cu:465
________________________________________________________________ test_checkpoint_fused_optimizer ________________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
____________________________________________________________ test_checkpoint_zero_optimizer[1-False] ____________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:47:22,915] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:47:22,917] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:47:22,917] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:47:22,939] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:47:22,939] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-02 16:47:22,939] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:47:22,939] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    weight_decay: 3e-07
)
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:47:22,939] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:47:22,939] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:47:22,939] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000
[2020-12-02 16:47:22,939] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 110, max elements per com: 500000000
[2020-12-02 16:47:22,940] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 55, padding: 0
[2020-12-02 16:47:22,940] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 110 + 0 = 110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2020-12-02 16:47:22,940] [INFO] [stage1.py:368:get_data_parallel_sub_partitions]        total_num_elements=110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:369:get_data_parallel_sub_partitions]        world_size=2
[2020-12-02 16:47:22,940] [INFO] [stage1.py:370:get_data_parallel_sub_partitions]        max_elements_per_comm=110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:371:get_data_parallel_sub_partitions]        sub_partition_size=55
[2020-12-02 16:47:22,940] [INFO] [stage1.py:372:get_data_parallel_sub_partitions]        num_sub_partitions=2
[2020-12-02 16:47:22,940] [INFO] [stage1.py:373:get_data_parallel_sub_partitions]        num_comm_intervals=1
[2020-12-02 16:47:22,940] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
[2020-12-02 16:47:22,942] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage1.FP16_DeepSpeedZeroOptimizer_Stage1 object at 0x7fb79d0bed30>
[2020-12-02 16:47:22,944] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'loss_scaler': <deepspeed.runtime.fp16.loss_scaler.DynamicLossScaler object at 0x7fb79d0bee50>, 'dynamic_loss_scale': True, 'overflow': False, 'base_optimizer_state': [[{'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}]], 'zero_stage': 1, 'partition_count': 2, 'num_comm_intervals_per_group': [1], 'local_sub_partitions_of_fp32_groups': [[tensor([ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
         2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01,
         1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
         3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01,
         8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
         1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01,
         1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
         1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02,
         1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
         2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01,
         1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02],
       device='cuda:0', grad_fn=<SliceBackward>)]]}
[2020-12-02 16:47:22,944] [INFO] [engine.py:461:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-12-02 16:47:22,944] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2020-12-02 16:47:22,944] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.00015], mom=[[0.8, 0.999]]
[2020-12-02 16:47:22,944] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bec10>
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   scheduler_name ............... None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   scheduler_params ............. None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bed60>
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_enabled ................. True
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_optimization_stage ...... 1
[2020-12-02 16:47:22,945] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.00015,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "steps_per_print":1,
    "train_batch_size":2,
    "zero_optimization":{
        "cpu_offload":false,
        "stage":1
    }
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 327, in _test_checkpoint_zero_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 609, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 719, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 303, in __init__
    self._initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 313, in _initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/fused_adam.py", line 167, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/multi_tensor_apply.py", line 15, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: no kernel image is available for execution on the device
____________________________________________________________ test_checkpoint_zero_optimizer[2-False] ____________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
__________________________________________________________ test_checkpoint_zero_no_optimizer[1-False] ___________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:49:40,493] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:49:40,494] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:49:40,494] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:49:40,518] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:49:40,518] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-02 16:49:40,519] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:49:40,519] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    weight_decay: 3e-07
)
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:49:40,519] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:49:40,519] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 110, max elements per com: 500000000
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 55, padding: 0
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 110 + 0 = 110
[2020-12-02 16:49:40,519] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2020-12-02 16:49:40,519] [INFO] [stage1.py:368:get_data_parallel_sub_partitions]        total_num_elements=110
[2020-12-02 16:49:40,520] [INFO] [stage1.py:369:get_data_parallel_sub_partitions]        world_size=2
[2020-12-02 16:49:40,520] [INFO] [stage1.py:370:get_data_parallel_sub_partitions]        max_elements_per_comm=110
[2020-12-02 16:49:40,520] [INFO] [stage1.py:371:get_data_parallel_sub_partitions]        sub_partition_size=55
[2020-12-02 16:49:40,520] [INFO] [stage1.py:372:get_data_parallel_sub_partitions]        num_sub_partitions=2
[2020-12-02 16:49:40,520] [INFO] [stage1.py:373:get_data_parallel_sub_partitions]        num_comm_intervals=1
[2020-12-02 16:49:40,520] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
[2020-12-02 16:49:40,521] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage1.FP16_DeepSpeedZeroOptimizer_Stage1 object at 0x7fb79d0bef40>
[2020-12-02 16:49:40,523] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'loss_scaler': <deepspeed.runtime.fp16.loss_scaler.DynamicLossScaler object at 0x7fb79d0befa0>, 'dynamic_loss_scale': True, 'overflow': False, 'base_optimizer_state': [[{'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}]], 'zero_stage': 1, 'partition_count': 2, 'num_comm_intervals_per_group': [1], 'local_sub_partitions_of_fp32_groups': [[tensor([ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
         2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01,
         1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
         3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01,
         8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
         1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01,
         1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
         1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02,
         1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
         2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01,
         1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02],
       device='cuda:0', grad_fn=<SliceBackward>)]]}
[2020-12-02 16:49:40,523] [INFO] [engine.py:461:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-12-02 16:49:40,523] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2020-12-02 16:49:40,523] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.00015], mom=[[0.8, 0.999]]
[2020-12-02 16:49:40,523] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:49:40,523] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bedc0>
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   scheduler_name ............... None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   scheduler_params ............. None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bedf0>
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_enabled ................. True
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_optimization_stage ...... 1
[2020-12-02 16:49:40,524] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.00015,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "steps_per_print":1,
    "train_batch_size":2,
    "zero_optimization":{
        "cpu_offload":false,
        "stage":1
    }
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 383, in _test_checkpoint_zero_no_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 609, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 719, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 303, in __init__
    self._initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 313, in _initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/fused_adam.py", line 167, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/multi_tensor_apply.py", line 15, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: no kernel image is available for execution on the device
__________________________________________________________ test_checkpoint_zero_no_optimizer[2-False] ___________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[1-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[1-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-False]

jeffra · 2020-12-03T01:02:49Z

Ah yeah, I think this makes sense. PR #572 should fix this issue. Essentially when pre-compiling our ops we weren't passing the compute capability flags for 8.0 which builds the cuda/c++ code with the right hardware capabilities.

I think the unit tests should work if you instead re-install and use JIT only compilation. JIT should pickup whatever compute capability that is being used at runtime.

jeffra · 2020-12-03T01:23:27Z

Merged #572, looks like it has a merge conflict though. Can you give it a try on your end after fixing the merge conflict with your change? Should be small I think.

stas00 · 2020-12-03T01:33:11Z

I built the binaries on your branch, tried one test - no change:

pytest -v --forked tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer           
====================================================================== test session starts ======================================================================
platform linux -- Python 3.8.5, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /home/stas/anaconda3/envs/main-38/bin/python
cachedir: .pytest_cache
rootdir: /mnt/nvme1/code/github/00optimize/deepspeed
plugins: hydra-core-1.0.3, forked-1.3.0, xdist-2.1.0, instafail-0.4.2, ipynb-1.1.1.dev0
collected 1 item                                                                                                                                                

tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED                                                                                [100%]

=========================================================================== FAILURES ============================================================================
_______________________________________________________________ test_checkpoint_unfused_optimizer _______________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 17:29:53,782] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+59ff206, git-hash=59ff206, git-branch=jeffra/cc80
[2020-12-02 17:29:53,783] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 17:29:53,783] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 17:29:53,808] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-12-02 17:29:53,808] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:29:53,808] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 17:29:53,808] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:29:53,808] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-12-02 17:29:53,810] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 17:29:53,814] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 65536.0, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2.0, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'step': 1, 'exp_avg': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), 'exp_avg_sq': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')}, 1: {'step': 1, 'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}}, 'param_groups': [{'lr': 0.00015, 'bias_correction': True, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0.0, 'max_grad_norm': 0.0, 'max_coeff': 10.0, 'min_coeff': 0.01, 'params': [0, 1]}]}, 'fp32_groups': [[tensor([[ 0.0839,  0.0599, -0.0598,  0.2319,  0.0323, -0.1559,  0.0842,  0.2578,
         -0.2104, -0.1906],
        [ 0.1914, -0.2820, -0.0882, -0.0095,  0.1659,  0.1089,  0.1512, -0.0975,
         -0.2190, -0.1095],
        [-0.2869,  0.1709,  0.2908,  0.0270, -0.0999,  0.1266,  0.1981,  0.1562,
         -0.2411, -0.2698],
        [-0.1201,  0.0515,  0.1439, -0.0302, -0.0568,  0.0694,  0.1666, -0.1321,
         -0.0742,  0.1001],
        [ 0.1855, -0.0106,  0.0769, -0.0176,  0.2812,  0.2426, -0.2013,  0.1372,
          0.2306, -0.0094],
        [ 0.1185,  0.1104, -0.2942, -0.0188, -0.2578,  0.0153,  0.3069, -0.0435,
          0.0638, -0.2389],
        [ 0.0477,  0.0047,  0.1968,  0.1646,  0.1123,  0.2030, -0.1915, -0.0587,
         -0.0530, -0.1991],
        [ 0.0147,  0.2159,  0.2598,  0.2520, -0.2839,  0.1892, -0.1602,  0.2883,
         -0.1327,  0.2002],
        [ 0.1338,  0.0575, -0.2703,  0.1372, -0.2605, -0.1187, -0.1914,  0.2290,
         -0.1929,  0.1995],
        [ 0.0037, -0.2291, -0.0471,  0.1107, -0.0403, -0.1243,  0.1771, -0.1327,
         -0.1637, -0.0156]], device='cuda:0', requires_grad=True), tensor([ 0.0542,  0.1185, -0.2170, -0.0490,  0.2025, -0.2917, -0.1114,  0.1796,
        -0.0192, -0.0256], device='cuda:0', requires_grad=True)]]}
[2020-12-02 17:29:53,814] [INFO] [engine.py:456:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = OneCycle
[2020-12-02 17:29:53,814] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.OneCycle object at 0x7fbecd579fa0>
[2020-12-02 17:29:53,814] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.85, 0.99)]
[2020-12-02 17:29:53,814] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fbecd515bb0>
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_clipping ............ 1.0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_name ............... lamb
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   scheduler_name ............... OneCycle
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   scheduler_params ............. {'cycle_first_step_size': 1000, 'cycle_first_stair_count': 500, 'cycle_second_step_size': 1000, 'cycle_second_stair_count': 500, 'decay_step_size': 1000, 'cycle_min_lr': 0.0001, 'cycle_max_lr': 0.001, 'decay_lr_rate': 0.001, 'cycle_min_mom': 0.85, 'cycle_max_mom': 0.99, 'decay_mom_rate': 0.0}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fbecd515c40>
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-02 17:29:53,815] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "lr":0.00015
        },
        "type":"Lamb"
    },
    "scheduler":{
        "params":{
            "cycle_first_stair_count":500,
            "cycle_first_step_size":1000,
            "cycle_max_lr":0.001,
            "cycle_max_mom":0.99,
            "cycle_min_lr":0.0001,
            "cycle_min_mom":0.85,
            "cycle_second_stair_count":500,
            "cycle_second_step_size":1000,
            "decay_lr_rate":0.001,
            "decay_mom_rate":0.0,
            "decay_step_size":1000
        },
        "type":"OneCycle"
    },
    "steps_per_print":1,
    "train_batch_size":2
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/lamb/fused_lamb_cuda_kernel.cu:465
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer

will try jit next.

stas00 · 2020-12-03T01:43:37Z

jit fails too, also weirdly it skipped over rtx-3090 card (0), and run the test on another older card (1). (same test as above)

There is a very long output, ending with:

[2020-12-02 17:39:30,777] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:39:30,777] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
Emitting ninja build file /home/stas/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/TH -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/THC -isystem /home/stas/anaconda3/envs/main-38/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 8.172157049179077 seconds
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu:465
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer

ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0.dev20201202+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7+b88a741, b88a741, patch-1
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0

jeffra · 2020-12-03T02:14:00Z

Regarding the JIT issue: what is CUDA_VISIBLE_DEVICES set to during runtime? Update: Ohh, actually this is interesting. Our unit tests will run across multiple GPUs, we typically run our tests on nodes with up to 4 gpus. I am not sure we handle this gracefully currently if the intended GPU count for a test is not available (cc/ @ShadenSmith).
Hopefully the new update fixes the issue with pre-installing the ops and running the unit tests?

stas00 · 2020-12-03T02:19:03Z

Regarding the JIT issue: what is CUDA_VISIBLE_DEVICES set to during runtime?

it wasn't set - both cards should have been visible

Hopefully the new update fixes the issue with pre-installing the ops and running the unit tests?

As you commented in 2 previous comments above yours - it didn't help. Please let me know if you need any other setup info to diagnose these.

jeffra · 2020-12-03T05:20:03Z

I just created a new issue for us to fix our unit tests so they are runnable on < 4 gpus. Unfortunately right now the best way for you to test out if everything is setup right is a combination of ds_report and running an example model on your hardware. Also, I don't believe we have done much of any testing on systems with heterogeneous GPU types since we don't have access to any systems like this. Are you intending to only run with 1 GPU type at a time or did you want to use both?

I think the original PR here seems fine so I'll merge it now. If you have any issues here don't hesitate to open an issue though.

stas00 · 2020-12-03T06:06:49Z

Are you intending to only run with 1 GPU type at a time or did you want to use both?

I have 2 gpus in my current box and building another box with 2 more older gpus. So hoping to be able to test complex multi-node setups.

I'm just impatiently waiting for cuda-11.2 to be released so that pytorch and friends would fully support rtx-3090. It has been a month of pain since I got the card...

g-karthik · 2020-12-24T05:10:20Z

@stas00 @jeffra I came across this issue because I'm having a somewhat related issue. I was recently trying to scale up my training jobs to 30+ nodes, and I found that the deepspeed.initialize() call fails with the following:

Traceback (most recent call last):
    model_engine, optimizer, _, _ = deepspeed.initialize(args, model, model_parameters=model.parameters())
    File "/usr/local/lib/python3.6/dist-packages/deepspeed/__init__.py", line 118, in initialize
    config_params=config_params)
    File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 148, in __init__
    dist.init_process_group(backend=self.dist_backend)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 168, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection reset by peer

It didn't really happen when I was training with lesser than 30 nodes. Any ideas on what's going on?

Also, this stack trace was printed just 3 times in my log file when I was training with 30 nodes (8 GPUs each), i.e., 240 processes. I suspect this implies that the reset only happened on a few processes, not all of them. However, the actual training obviously didn't really start due to this RuntimeError on some processes.

If this is unrelated to this issue (I just landed on this because I searched for Connection reset by peer), I can create a new one.

stas00 · 2020-12-24T05:21:20Z

The PR itself was about a totally different thing. It's the tests that failed to run on my 2-gpu single node setup, but the failure was unrelated to the PR itself.

The test failure does look similar to your issue. If you look at my report - process 1 reported a failure, and in response process 2 reset the connection. Is it possible that some process failed in your setup but somehow you didn't see the error reported? I saw at least one situation with deepspeed where it just exit(0)ed without any error message (when I misconfigured the config file).

But probably a separate issue would be a good way to proceed and link to the failure I reported as related.

stas00 requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners December 2, 2020 20:32

jeffra linked an issue Dec 2, 2020 that may be closed by this pull request

Cuda 11 cannot be supported #536

Closed

Merge branch 'master' into patch-1

0caba31

jeffra approved these changes Dec 3, 2020

View reviewed changes

Merge branch 'master' into patch-1

0cb1f3b

stas00 added 2 commits December 2, 2020 17:36

Merge remote-tracking branch 'origin/master' into patch-1

8e451de

correct merge

b88a741

jeffra mentioned this pull request Dec 3, 2020

Unit tests should be runnable with < 4 GPUs #574

Closed

jeffra merged commit ff58fa7 into microsoft:master Dec 3, 2020

stas00 deleted the patch-1 branch December 3, 2020 06:02

t1101675 mentioned this pull request Feb 8, 2021

[RuntimeError: Connection reset by peer] When scaling up training jobs #733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

stas00 commented Dec 2, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

jeffra commented Dec 3, 2020

stas00 commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020

jeffra commented Dec 3, 2020

stas00 commented Dec 3, 2020

g-karthik commented Dec 24, 2020 •

edited

Loading

stas00 commented Dec 24, 2020 •

edited

Loading

[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

Conversation

stas00 commented Dec 2, 2020 • edited Loading

jeffra commented Dec 3, 2020 • edited Loading

stas00 commented Dec 3, 2020 • edited Loading

jeffra commented Dec 3, 2020 • edited Loading

jeffra commented Dec 3, 2020

stas00 commented Dec 3, 2020 • edited Loading

stas00 commented Dec 3, 2020 • edited Loading

jeffra commented Dec 3, 2020 • edited Loading

stas00 commented Dec 3, 2020

jeffra commented Dec 3, 2020

stas00 commented Dec 3, 2020

g-karthik commented Dec 24, 2020 • edited Loading

stas00 commented Dec 24, 2020 • edited Loading

stas00 commented Dec 2, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020 •

edited

Loading

stas00 commented Dec 3, 2020 •

edited

Loading

jeffra commented Dec 3, 2020 •

edited

Loading

g-karthik commented Dec 24, 2020 •

edited

Loading

stas00 commented Dec 24, 2020 •

edited

Loading