Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 #570

Merged
merged 5 commits into from
Dec 3, 2020

Conversation

stas00
Copy link
Collaborator

@stas00 stas00 commented Dec 2, 2020

I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?). Note that tensorflow requires cuda-11.1 to work with rtx-30*. So while I do have 11.0 and 11.1 installed, the builder can't find 11.0 automatically.

Can probably remove this when cuda-11.2 comes out and we get pytorch fully supporting Ampere - until then pytorch can't be built with cuda-11.1.

I verified that I was able to build all options with:

DS_BUILD_OPS=1 pip install deepspeed -v .
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0.dev20201202+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7+7a75f8b, 7a75f8b, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0

Otherwise with rtx-3090 I was getting:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

… built with cuda-11.0

I learned this from nvidia apex, it works to build against installed cuda-11.1 while torch was built with cuda-11.0 - as the API is similar (identical?) 

Can probably remove this when cuda-11.2 comes out and we get pytorch supporting Ampere - until then pytorch can't be built with cuda-11.1.

I verified that I was able to build all options with:

```
DS_BUILD_OPS=1 pip install deepspeed -v .
```
```
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0.dev20201202+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7+7a75f8b, 7a75f8b, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0
```

Otherwise with rtx-3090 I was getting:
`RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8`
@jeffra
Copy link
Collaborator

jeffra commented Dec 3, 2020

Thanks for this @stas00. Would you also be able to run our unit tests in your environment? I don't readily have access to a cuda 11.1 machine (most up to date I have is 11.0) and don't have any access to rtx-3090s either.

pip install -r requirements/requirements-dev.txt
pytest --forked tests/unit/

@stas00
Copy link
Collaborator Author

stas00 commented Dec 3, 2020

I tried but getting lots of errors,

tests/unit/test_activation_checkpointing.py::test_ckpt_inputs1_outputs1 PASSED                                                                            [  0%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs1[mask0] PASSED                                                                     [  0%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs1[mask1] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs2[mask0] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs2[mask1] PASSED                                                                     [  1%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs3[mask0] PASSED                                                                     [  2%]
tests/unit/test_activation_checkpointing.py::test_ckpt_inputs2_outputs3[mask1] PASSED                                                                     [  2%]
tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED                                                                                [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer FAILED                                                                                  [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[1-False] FAILED                                                                          [  3%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-False] FAILED                                                                          [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-True] PASSED                                                                           [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[1-False] FAILED                                                                       [  4%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-False] FAILED                                                                       [  5%]
tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-True] PASSED                                                                        [  5%]

I aborted as it was too slow and already clear that something isn't right - but I have never run this test suite before so perhaps the failures are unrelated. Your CI seems to be running all those fine.

Here is error log so far:

=========================================================================== FAILURES ============================================================================
_______________________________________________________________ test_checkpoint_unfused_optimizer _______________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:45:19,843] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:45:19,844] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:45:19,844] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:45:19,869] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 16:45:19,869] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-12-02 16:45:19,869] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 16:45:19,869] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 16:45:19,869] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-12-02 16:45:19,871] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 16:45:19,875] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 65536.0, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2.0, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'step': 1, 'exp_avg': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), 'exp_avg_sq': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')}, 1: {'step': 1, 'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}}, 'param_groups': [{'lr': 0.00015, 'bias_correction': True, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0.0, 'max_grad_norm': 0.0, 'max_coeff': 10.0, 'min_coeff': 0.01, 'params': [0, 1]}]}, 'fp32_groups': [[tensor([[ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
          2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01],
        [ 1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
          3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01],
        [ 8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
          1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01],
        [ 1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
          1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02],
        [ 1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
          2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01],
        [ 1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02,
         -2.0178e-01, -3.0640e-01,  2.5098e-01,  3.2349e-03,  2.1497e-01],
        [-2.2705e-01,  2.4646e-01, -5.7312e-02,  1.8726e-01, -2.4475e-01,
          2.3169e-01,  1.2122e-01,  2.0642e-01,  1.3562e-01,  5.7495e-02],
        [-1.3403e-01, -1.2054e-01, -2.9395e-01,  2.5049e-01,  1.6028e-01,
          1.5732e-02, -2.5415e-01, -7.2388e-02, -2.4878e-01, -1.1554e-01],
        [ 1.8079e-01, -2.3523e-01,  7.6172e-02, -2.5464e-01, -1.9128e-01,
         -1.2091e-01, -2.0068e-01,  2.5293e-01, -1.0040e-01, -1.0486e-01],
        [ 1.2878e-01,  1.7493e-01, -1.3281e-01,  6.5552e-02, -2.8687e-01,
         -2.7173e-01, -7.0557e-02, -1.6553e-01, -2.8809e-01, -3.3783e-02]],
       device='cuda:0', requires_grad=True), tensor([-0.0440,  0.0008, -0.0169, -0.3091, -0.2874, -0.1752, -0.1694,  0.2095,
         0.2372,  0.1008], device='cuda:0', requires_grad=True)]]}
[2020-12-02 16:45:19,876] [INFO] [engine.py:456:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = OneCycle
[2020-12-02 16:45:19,876] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.OneCycle object at 0x7fb79d0bebb0>
[2020-12-02 16:45:19,876] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.85, 0.99)]
[2020-12-02 16:45:19,876] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bef40>
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_clipping ............ 1.0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_name ............... lamb
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   scheduler_name ............... OneCycle
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   scheduler_params ............. {'cycle_first_step_size': 1000, 'cycle_first_stair_count': 500, 'cycle_second_step_size': 1000, 'cycle_second_stair_count': 500, 'decay_step_size': 1000, 'cycle_min_lr': 0.0001, 'cycle_max_lr': 0.001, 'decay_lr_rate': 0.001, 'cycle_min_mom': 0.85, 'cycle_max_mom': 0.99, 'decay_mom_rate': 0.0}
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:45:19,876] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bef10>
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-02 16:45:19,877] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-02 16:45:19,877] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "lr":0.00015
        },
        "type":"Lamb"
    },
    "scheduler":{
        "params":{
            "cycle_first_stair_count":500,
            "cycle_first_step_size":1000,
            "cycle_max_lr":0.001,
            "cycle_max_mom":0.99,
            "cycle_min_lr":0.0001,
            "cycle_min_mom":0.85,
            "cycle_second_stair_count":500,
            "cycle_second_step_size":1000,
            "decay_lr_rate":0.001,
            "decay_mom_rate":0.0,
            "decay_step_size":1000
        },
        "type":"OneCycle"
    },
    "steps_per_print":1,
    "train_batch_size":2
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/lamb/fused_lamb_cuda_kernel.cu:465
________________________________________________________________ test_checkpoint_fused_optimizer ________________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
____________________________________________________________ test_checkpoint_zero_optimizer[1-False] ____________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:47:22,915] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:47:22,917] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:47:22,917] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:47:22,939] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:47:22,939] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-02 16:47:22,939] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:47:22,939] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    weight_decay: 3e-07
)
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:47:22,939] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:47:22,939] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:47:22,939] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000
[2020-12-02 16:47:22,939] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 110, max elements per com: 500000000
[2020-12-02 16:47:22,940] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 55, padding: 0
[2020-12-02 16:47:22,940] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 110 + 0 = 110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2020-12-02 16:47:22,940] [INFO] [stage1.py:368:get_data_parallel_sub_partitions]        total_num_elements=110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:369:get_data_parallel_sub_partitions]        world_size=2
[2020-12-02 16:47:22,940] [INFO] [stage1.py:370:get_data_parallel_sub_partitions]        max_elements_per_comm=110
[2020-12-02 16:47:22,940] [INFO] [stage1.py:371:get_data_parallel_sub_partitions]        sub_partition_size=55
[2020-12-02 16:47:22,940] [INFO] [stage1.py:372:get_data_parallel_sub_partitions]        num_sub_partitions=2
[2020-12-02 16:47:22,940] [INFO] [stage1.py:373:get_data_parallel_sub_partitions]        num_comm_intervals=1
[2020-12-02 16:47:22,940] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
[2020-12-02 16:47:22,942] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage1.FP16_DeepSpeedZeroOptimizer_Stage1 object at 0x7fb79d0bed30>
[2020-12-02 16:47:22,944] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'loss_scaler': <deepspeed.runtime.fp16.loss_scaler.DynamicLossScaler object at 0x7fb79d0bee50>, 'dynamic_loss_scale': True, 'overflow': False, 'base_optimizer_state': [[{'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}]], 'zero_stage': 1, 'partition_count': 2, 'num_comm_intervals_per_group': [1], 'local_sub_partitions_of_fp32_groups': [[tensor([ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
         2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01,
         1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
         3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01,
         8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
         1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01,
         1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
         1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02,
         1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
         2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01,
         1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02],
       device='cuda:0', grad_fn=<SliceBackward>)]]}
[2020-12-02 16:47:22,944] [INFO] [engine.py:461:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-12-02 16:47:22,944] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2020-12-02 16:47:22,944] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.00015], mom=[[0.8, 0.999]]
[2020-12-02 16:47:22,944] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bec10>
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:47:22,944] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   scheduler_name ............... None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   scheduler_params ............. None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bed60>
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_enabled ................. True
[2020-12-02 16:47:22,945] [INFO] [config.py:648:print]   zero_optimization_stage ...... 1
[2020-12-02 16:47:22,945] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.00015,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "steps_per_print":1,
    "train_batch_size":2,
    "zero_optimization":{
        "cpu_offload":false,
        "stage":1
    }
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 327, in _test_checkpoint_zero_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 609, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 719, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 303, in __init__
    self._initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 313, in _initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/fused_adam.py", line 167, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/multi_tensor_apply.py", line 15, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: no kernel image is available for execution on the device
____________________________________________________________ test_checkpoint_zero_optimizer[2-False] ____________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
__________________________________________________________ test_checkpoint_zero_no_optimizer[1-False] ___________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 16:49:40,493] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+845921b, git-hash=845921b, git-branch=master
[2020-12-02 16:49:40,494] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 16:49:40,494] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:49:40,518] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:49:40,518] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-02 16:49:40,519] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:49:40,519] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    weight_decay: 3e-07
)
Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2020-12-02 16:49:40,519] [INFO] [engine.py:715:_configure_zero_optimizer] Creating fp16 ZeRO stage 1 optimizer
[2020-12-02 16:49:40,519] [INFO] [stage1.py:152:__init__] ZeRO Elastic Checkpoint = True
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 500000000
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 110, max elements per com: 500000000
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 55, padding: 0
[2020-12-02 16:49:40,519] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 110 + 0 = 110
[2020-12-02 16:49:40,519] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2020-12-02 16:49:40,519] [INFO] [stage1.py:368:get_data_parallel_sub_partitions]        total_num_elements=110
[2020-12-02 16:49:40,520] [INFO] [stage1.py:369:get_data_parallel_sub_partitions]        world_size=2
[2020-12-02 16:49:40,520] [INFO] [stage1.py:370:get_data_parallel_sub_partitions]        max_elements_per_comm=110
[2020-12-02 16:49:40,520] [INFO] [stage1.py:371:get_data_parallel_sub_partitions]        sub_partition_size=55
[2020-12-02 16:49:40,520] [INFO] [stage1.py:372:get_data_parallel_sub_partitions]        num_sub_partitions=2
[2020-12-02 16:49:40,520] [INFO] [stage1.py:373:get_data_parallel_sub_partitions]        num_comm_intervals=1
[2020-12-02 16:49:40,520] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
[2020-12-02 16:49:40,521] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage1.FP16_DeepSpeedZeroOptimizer_Stage1 object at 0x7fb79d0bef40>
[2020-12-02 16:49:40,523] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'loss_scaler': <deepspeed.runtime.fp16.loss_scaler.DynamicLossScaler object at 0x7fb79d0befa0>, 'dynamic_loss_scale': True, 'overflow': False, 'base_optimizer_state': [[{'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}]], 'zero_stage': 1, 'partition_count': 2, 'num_comm_intervals_per_group': [1], 'local_sub_partitions_of_fp32_groups': [[tensor([ 1.8066e-01, -4.1809e-02, -2.1033e-01,  2.9492e-01,  9.6497e-02,
         2.1765e-01, -4.2755e-02, -2.2693e-01, -2.3230e-01, -1.7126e-01,
         1.3281e-01,  3.3173e-02, -1.5320e-01, -2.7145e-02,  2.9712e-01,
         3.0225e-01,  8.7708e-02, -9.4299e-02,  9.7412e-02,  2.9346e-01,
         8.5083e-02, -1.7993e-01, -4.7699e-02, -1.3220e-01,  1.9739e-01,
         1.8347e-01,  3.1030e-01, -2.9980e-01, -1.6663e-01, -1.7114e-01,
         1.2341e-01, -3.0151e-01, -1.5161e-01,  2.8275e-02,  7.4280e-02,
         1.2817e-01,  2.7759e-01,  6.1951e-02, -1.4856e-01, -2.5635e-02,
         1.6475e-04, -2.2827e-01, -4.8218e-02, -4.7272e-02,  2.2180e-01,
         2.2009e-01,  1.4600e-01, -2.6147e-01,  1.8787e-01,  1.6760e-01,
         1.1725e-01,  1.4636e-01,  2.0190e-01, -1.5845e-01, -4.1870e-02],
       device='cuda:0', grad_fn=<SliceBackward>)]]}
[2020-12-02 16:49:40,523] [INFO] [engine.py:461:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-12-02 16:49:40,523] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2020-12-02 16:49:40,523] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.00015], mom=[[0.8, 0.999]]
[2020-12-02 16:49:40,523] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 16:49:40,523] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fb79d0bedc0>
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   scheduler_name ............... None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   scheduler_params ............. None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fb79d0bedf0>
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_enabled ................. True
[2020-12-02 16:49:40,524] [INFO] [config.py:648:print]   zero_optimization_stage ...... 1
[2020-12-02 16:49:40,524] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.00015,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "steps_per_print":1,
    "train_batch_size":2,
    "zero_optimization":{
        "cpu_offload":false,
        "stage":1
    }
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 383, in _test_checkpoint_zero_no_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 609, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 719, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 303, in __init__
    self._initialize_optimizer_states()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage1.py", line 313, in _initialize_optimizer_states
    self.optimizer.step()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/fused_adam.py", line 167, in step
    multi_tensor_applier(self.multi_tensor_adam,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/ops/adam/multi_tensor_apply.py", line 15, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: no kernel image is available for execution on the device
__________________________________________________________ test_checkpoint_zero_no_optimizer[2-False] ___________________________________________________________
Worker 0 exited with code 1
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 36, in dist_init
    dist.init_process_group(backend=backend,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 467, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 193, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Connection reset by peer
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[1-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_optimizer[2-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[1-False]
FAILED tests/unit/test_checkpointing.py::test_checkpoint_zero_no_optimizer[2-False]

@jeffra
Copy link
Collaborator

jeffra commented Dec 3, 2020

Ah yeah, I think this makes sense. PR #572 should fix this issue. Essentially when pre-compiling our ops we weren't passing the compute capability flags for 8.0 which builds the cuda/c++ code with the right hardware capabilities.

I think the unit tests should work if you instead re-install and use JIT only compilation. JIT should pickup whatever compute capability that is being used at runtime.

@jeffra
Copy link
Collaborator

jeffra commented Dec 3, 2020

Merged #572, looks like it has a merge conflict though. Can you give it a try on your end after fixing the merge conflict with your change? Should be small I think.

@stas00
Copy link
Collaborator Author

stas00 commented Dec 3, 2020

I built the binaries on your branch, tried one test - no change:

pytest -v --forked tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer           
====================================================================== test session starts ======================================================================
platform linux -- Python 3.8.5, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /home/stas/anaconda3/envs/main-38/bin/python
cachedir: .pytest_cache
rootdir: /mnt/nvme1/code/github/00optimize/deepspeed
plugins: hydra-core-1.0.3, forked-1.3.0, xdist-2.1.0, instafail-0.4.2, ipynb-1.1.1.dev0
collected 1 item                                                                                                                                                

tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED                                                                                [100%]

=========================================================================== FAILURES ============================================================================
_______________________________________________________________ test_checkpoint_unfused_optimizer _______________________________________________________________
Worker 0 hung.
--------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------
[2020-12-02 17:29:53,782] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.7+59ff206, git-hash=59ff206, git-branch=jeffra/cc80
[2020-12-02 17:29:53,783] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 17:29:53,783] [INFO] [engine.py:69:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-02 17:29:53,808] [INFO] [engine.py:592:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-12-02 17:29:53,808] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:29:53,808] [INFO] [engine.py:597:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 17:29:53,808] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:29:53,808] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-12-02 17:29:53,810] [INFO] [engine.py:627:_configure_optimizer] DeepSpeed Final Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-12-02 17:29:53,814] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 65536.0, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2.0, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'step': 1, 'exp_avg': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), 'exp_avg_sq': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')}, 1: {'step': 1, 'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')}}, 'param_groups': [{'lr': 0.00015, 'bias_correction': True, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0.0, 'max_grad_norm': 0.0, 'max_coeff': 10.0, 'min_coeff': 0.01, 'params': [0, 1]}]}, 'fp32_groups': [[tensor([[ 0.0839,  0.0599, -0.0598,  0.2319,  0.0323, -0.1559,  0.0842,  0.2578,
         -0.2104, -0.1906],
        [ 0.1914, -0.2820, -0.0882, -0.0095,  0.1659,  0.1089,  0.1512, -0.0975,
         -0.2190, -0.1095],
        [-0.2869,  0.1709,  0.2908,  0.0270, -0.0999,  0.1266,  0.1981,  0.1562,
         -0.2411, -0.2698],
        [-0.1201,  0.0515,  0.1439, -0.0302, -0.0568,  0.0694,  0.1666, -0.1321,
         -0.0742,  0.1001],
        [ 0.1855, -0.0106,  0.0769, -0.0176,  0.2812,  0.2426, -0.2013,  0.1372,
          0.2306, -0.0094],
        [ 0.1185,  0.1104, -0.2942, -0.0188, -0.2578,  0.0153,  0.3069, -0.0435,
          0.0638, -0.2389],
        [ 0.0477,  0.0047,  0.1968,  0.1646,  0.1123,  0.2030, -0.1915, -0.0587,
         -0.0530, -0.1991],
        [ 0.0147,  0.2159,  0.2598,  0.2520, -0.2839,  0.1892, -0.1602,  0.2883,
         -0.1327,  0.2002],
        [ 0.1338,  0.0575, -0.2703,  0.1372, -0.2605, -0.1187, -0.1914,  0.2290,
         -0.1929,  0.1995],
        [ 0.0037, -0.2291, -0.0471,  0.1107, -0.0403, -0.1243,  0.1771, -0.1327,
         -0.1637, -0.0156]], device='cuda:0', requires_grad=True), tensor([ 0.0542,  0.1185, -0.2170, -0.0490,  0.2025, -0.2917, -0.1114,  0.1796,
        -0.0192, -0.0256], device='cuda:0', requires_grad=True)]]}
[2020-12-02 17:29:53,814] [INFO] [engine.py:456:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = OneCycle
[2020-12-02 17:29:53,814] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.OneCycle object at 0x7fbecd579fa0>
[2020-12-02 17:29:53,814] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.85, 0.99)]
[2020-12-02 17:29:53,814] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fbecd515bb0>
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... None
[2020-12-02 17:29:53,814] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_clipping ............ 1.0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_name ............... lamb
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 0.00015}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   scheduler_name ............... OneCycle
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   scheduler_params ............. {'cycle_first_step_size': 1000, 'cycle_first_stair_count': 500, 'cycle_second_step_size': 1000, 'cycle_second_stair_count': 500, 'decay_step_size': 1000, 'cycle_min_lr': 0.0001, 'cycle_max_lr': 0.001, 'decay_lr_rate': 0.001, 'cycle_min_mom': 0.85, 'cycle_max_mom': 0.99, 'decay_mom_rate': 0.0}
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   steps_per_print .............. 1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   train_batch_size ............. 2
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  1
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_config .................. <deepspeed.runtime.zero.config.DeepSpeedZeroConfig object at 0x7fbecd515c40>
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-02 17:29:53,815] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-02 17:29:53,815] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true
    },
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "lr":0.00015
        },
        "type":"Lamb"
    },
    "scheduler":{
        "params":{
            "cycle_first_stair_count":500,
            "cycle_first_step_size":1000,
            "cycle_max_lr":0.001,
            "cycle_max_mom":0.99,
            "cycle_min_lr":0.0001,
            "cycle_min_mom":0.85,
            "cycle_second_stair_count":500,
            "cycle_second_step_size":1000,
            "decay_lr_rate":0.001,
            "decay_mom_rate":0.0,
            "decay_step_size":1000
        },
        "type":"OneCycle"
    },
    "steps_per_print":1,
    "train_batch_size":2
}
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/lamb/fused_lamb_cuda_kernel.cu:465
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer

will try jit next.

@stas00
Copy link
Collaborator Author

stas00 commented Dec 3, 2020

jit fails too, also weirdly it skipped over rtx-3090 card (0), and run the test on another older card (1). (same test as above)

There is a very long output, ending with:

[2020-12-02 17:39:30,777] [INFO] [engine.py:701:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-02 17:39:30,777] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
Emitting ninja build file /home/stas/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/TH -isystem /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/THC -isystem /home/stas/anaconda3/envs/main-38/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 8.172157049179077 seconds
--------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------
THCudaCheck FAIL file=/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu line=465 error=209 : no kernel image is available for execution on the device
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/common.py", line 46, in dist_init
    run_func(*func_args, **func_kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 223, in _test_checkpoint_unfused_optimizer
    checkpoint_correctness_verification(args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 134, in checkpoint_correctness_verification
    ds_model = create_deepspeed_model(args=args,
  File "/mnt/nvme1/code/github/00optimize/deepspeed/tests/unit/test_checkpointing.py", line 111, in create_deepspeed_model
    ds_model, _, _, _ = deepspeed.initialize(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/__init__.py", line 109, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 181, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 624, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 702, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 101, in __init__
    self.initialize_optimizer_states()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 368, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/lamb/fused_lamb.py", line 168, in step
    lamb_coeff = self.fused_lamb_cuda.lamb(p.data,
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu:465
==================================================================== short test summary info ====================================================================
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch']
torch version .................... 1.8.0.dev20201202+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7+b88a741, b88a741, patch-1
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.0

@jeffra
Copy link
Collaborator

jeffra commented Dec 3, 2020

  1. Regarding the JIT issue: what is CUDA_VISIBLE_DEVICES set to during runtime? Update: Ohh, actually this is interesting. Our unit tests will run across multiple GPUs, we typically run our tests on nodes with up to 4 gpus. I am not sure we handle this gracefully currently if the intended GPU count for a test is not available (cc/ @ShadenSmith).

  2. Hopefully the new update fixes the issue with pre-installing the ops and running the unit tests?

@stas00
Copy link
Collaborator Author

stas00 commented Dec 3, 2020

  • Regarding the JIT issue: what is CUDA_VISIBLE_DEVICES set to during runtime?

it wasn't set - both cards should have been visible

  • Hopefully the new update fixes the issue with pre-installing the ops and running the unit tests?

As you commented in 2 previous comments above yours - it didn't help. Please let me know if you need any other setup info to diagnose these.

@jeffra
Copy link
Collaborator

jeffra commented Dec 3, 2020

I just created a new issue for us to fix our unit tests so they are runnable on < 4 gpus. Unfortunately right now the best way for you to test out if everything is setup right is a combination of ds_report and running an example model on your hardware. Also, I don't believe we have done much of any testing on systems with heterogeneous GPU types since we don't have access to any systems like this. Are you intending to only run with 1 GPU type at a time or did you want to use both?

I think the original PR here seems fine so I'll merge it now. If you have any issues here don't hesitate to open an issue though.

@jeffra jeffra merged commit ff58fa7 into microsoft:master Dec 3, 2020
@stas00 stas00 deleted the patch-1 branch December 3, 2020 06:02
@stas00
Copy link
Collaborator Author

stas00 commented Dec 3, 2020

Are you intending to only run with 1 GPU type at a time or did you want to use both?

I have 2 gpus in my current box and building another box with 2 more older gpus. So hoping to be able to test complex multi-node setups.

I'm just impatiently waiting for cuda-11.2 to be released so that pytorch and friends would fully support rtx-3090. It has been a month of pain since I got the card...

@g-karthik
Copy link

g-karthik commented Dec 24, 2020

@stas00 @jeffra I came across this issue because I'm having a somewhat related issue. I was recently trying to scale up my training jobs to 30+ nodes, and I found that the deepspeed.initialize() call fails with the following:

Traceback (most recent call last):
    model_engine, optimizer, _, _ = deepspeed.initialize(args, model, model_parameters=model.parameters())
    File "/usr/local/lib/python3.6/dist-packages/deepspeed/__init__.py", line 118, in initialize
    config_params=config_params)
    File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 148, in __init__
    dist.init_process_group(backend=self.dist_backend)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 168, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Connection reset by peer

It didn't really happen when I was training with lesser than 30 nodes. Any ideas on what's going on?

Also, this stack trace was printed just 3 times in my log file when I was training with 30 nodes (8 GPUs each), i.e., 240 processes. I suspect this implies that the reset only happened on a few processes, not all of them. However, the actual training obviously didn't really start due to this RuntimeError on some processes.

If this is unrelated to this issue (I just landed on this because I searched for Connection reset by peer), I can create a new one.

@stas00
Copy link
Collaborator Author

stas00 commented Dec 24, 2020

The PR itself was about a totally different thing. It's the tests that failed to run on my 2-gpu single node setup, but the failure was unrelated to the PR itself.

The test failure does look similar to your issue. If you look at my report - process 1 reported a failure, and in response process 2 reset the connection. Is it possible that some process failed in your setup but somehow you didn't see the error reported? I saw at least one situation with deepspeed where it just exit(0)ed without any error message (when I misconfigured the config file).

But probably a separate issue would be a good way to proceed and link to the failure I reported as related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cuda 11 cannot be supported
3 participants