SFT not working on nemo:24.05.01 container #236

vecorro · 2024-07-13T17:41:07Z

Describe the bug

I'm trying to follow the SFT Tutorial, on a Llama-3-8b LLM and the process fails
with error torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Failed to CUDA calloc async 608 bytes

Steps/Code to reproduce bug

Within the NeMO container, I followed the tutorial to convert the Llama-3-8b weights from HF to nemo format, then I ran the following commands:

cd /opt/NeMo-Aligner/

python examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.num_nodes=1 \
   trainer.devices=1 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   model.megatron_amp_O2=True \
   model.restore_from_path=/workspace/nemo/models/llama3-8b/mcore_gpt.nemo \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=/results \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss

Expected behavior

I expected SFT to suceed

Environment overview (please complete the following information)

Environment location: VMware vSphere 8
Method of NeMo-Aligner install: I used container nemo:24.05.01
If method of install is [Docker], provide docker pull & docker run commands used

docker pull nvcr.io/nvidia/nemo:24.05.01

docker run --runtime nvidia --gpus all \
    -v ~/HF:/workspace/huggingface \
    -v ~/nemo:/workspace/nemo \
    --name my_nemo -td nvcr.io/nvidia/nemo:24.05.01

Environment details
nvidia-smi output from within the container

nvidia-smi
Sat Jul 13 17:38:14 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100XM-80C              On  |   00000000:03:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100XM-80C              On  |   00000000:03:01.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Additional context

I'm using 2 x H100 GPUs. These work normally as I have used them already on the same VM to serve Llama-3-70b using vLLM without any issues. Of course, the vLLM container got stopped before I tried to run SFT on the NeMo Alignment container.

Stack error:

[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:118] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-07-13 17:26:30 train_gpt_sft:119] 
    name: megatron_gpt_sft
    trainer:
      num_nodes: 1
      devices: 1
      accelerator: gpu
      precision: bf16
      sft:
        max_epochs: 1
        max_steps: -1
        val_check_interval: 1000
        save_interval: ${.val_check_interval}
        limit_train_batches: 1.0
        limit_val_batches: 40
        gradient_clip_val: 1.0
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_time: null
      max_epochs: ${.sft.max_epochs}
      max_steps: ${.sft.max_steps}
    exp_manager:
      explicit_log_dir: /results
      exp_dir: null
      name: ${name}
      create_wandb_logger: true
      wandb_logger_kwargs:
        project: sft_run
        name: dolly_sft_run
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_loss
        save_top_k: 5
        mode: min
        save_nemo_on_train_end: true
        filename: megatron_gpt_sft--{${.monitor}:.3f}-{step}-{consumed_samples}-{epoch}
        model_parallel_size: ${model.tensor_model_parallel_size}
        save_best_model: false
    model:
      seed: 1234
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      restore_from_path: /workspace/nemo/models/llama3-8b/mcore_gpt.nemo
      resume_from_checkpoint: null
      save_nemo_on_validation_end: true
      sync_batch_comm: false
      megatron_amp_O2: true
      encoder_seq_length: 4096
      sequence_parallel: false
      activations_checkpoint_granularity: null
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: null
      activations_checkpoint_layers_per_pipeline: null
      answer_only_loss: true
      gradient_as_bucket_view: false
      seq_len_interpolation_factor: null
      use_flash_attention: null
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      steerlm2:
        forward_micro_batch_size: 1
        micro_batch_size: 1
      peft:
        peft_scheme: none
        restore_from_path: null
        lora_tuning:
          target_modules:
          - attention_qkv
          adapter_dim: 32
          adapter_dropout: 0.0
          column_init_method: xavier
          row_init_method: zero
          layer_selection: null
          weight_tying: false
          position_embedding_strategy: null
      data:
        chat: false
        chat_prompt_tokens:
          system_turn_start: "\0"
          turn_start: "\x11"
          label_start: "\x12"
          end_of_turn: '
    
            '
          end_of_name: '
    
            '
        sample: false
        num_workers: 0
        dataloader_type: single
        train_ds:
          file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: true
          memmap_workers: null
          max_seq_length: ${model.encoder_seq_length}
          min_seq_length: 1
          drop_last: true
          label_key: output
          add_eos: true
          add_sep: false
          add_bos: false
          truncation_field: input
          index_mapping_dir: null
          prompt_template: '{input} {output}'
          hf_dataset: false
          truncation_method: right
        validation_ds:
          file_path: /workspace/nemo/datasets/databricks-dolly-15k-output.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: false
          memmap_workers: ${model.data.train_ds.memmap_workers}
          max_seq_length: ${model.data.train_ds.max_seq_length}
          min_seq_length: 1
          drop_last: true
          label_key: ${model.data.train_ds.label_key}
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          hf_dataset: false
          truncation_method: right
          output_original_text: true
      optim:
        name: distributed_fused_adam
        lr: 5.0e-06
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 10
          constant_steps: 1000
          min_lr: 9.0e-07
    
[NeMo W 2024-07-13 17:26:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-07-13 17:26:31 exp_manager:708] Exp_manager is logging to /results, but it already exists.
[NeMo W 2024-07-13 17:26:31 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/results/checkpoints. Training from scratch.
[NeMo I 2024-07-13 17:26:31 exp_manager:396] Experiments will be logged at /results
[NeMo I 2024-07-13 17:26:31 exp_manager:856] TensorboardLogger has been set up
[NeMo I 2024-07-13 17:26:31 exp_manager:871] WandBLogger has been set up
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-07-13 17:26:42 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:310] All tensor model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:311] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-07-13 17:26:42 megatron_init:331] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:343] Rank 0 has embedding group: [0]
[NeMo I 2024-07-13 17:26:42 megatron_init:349] All pipeline model parallel group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:350] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-07-13 17:26:42 megatron_init:351] All embedding group ranks: [[0]]
[NeMo I 2024-07-13 17:26:42 megatron_init:352] Rank 0 has embedding rank: 0
24-07-13 17:26:42 - PID:928 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo I 2024-07-13 17:26:42 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B
[NeMo W 2024-07-13 17:26:42 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
      warnings.warn(
    
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[NeMo I 2024-07-13 17:26:42 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_overlap_rs_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:1158] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:498] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: activation_func_fp8_input_store in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: qk_layernorm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: test_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: calculate_per_token_loss in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_dot_product_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: fp8_multi_head_attention in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_load_balancing_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_router_topk in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_grouped_gemm in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_aux_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_z_loss_coeff in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_input_jitter_eps in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dropping in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_dispatcher_type in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_per_layer_logging in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_expert_capacity_factor in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_pad_expert_input_to_capacity in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_token_drop_policy in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: moe_layer_recompute in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: disable_parameter_transpose_cache in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: enable_cuda_graph in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-07-13 17:26:42 megatron_base_model:556] The model: GPTSFTModel() does not have field.name: rotary_percent in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[NeMo I 2024-07-13 17:26:56 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.sft.max_steps=-1', 'trainer.sft.limit_val_batches=40', 'trainer.sft.val_check_interval=1000', 'model.megatron_amp_O2=True', 'model.restore_from_path=/workspace/nemo/models/llama3-8b/mcore_gpt.nemo', 'model.optim.lr=5e-6', 'model.answer_only_loss=True', 'model.data.num_workers=0', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_batch_size=128', 'model.data.train_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.validation_ds.file_path=/workspace/nemo/datasets/databricks-dolly-15k-output.jsonl', 'exp_manager.create_wandb_logger=True', 'exp_manager.explicit_log_dir=/results', 'exp_manager.wandb_logger_kwargs.project=sft_run', 'exp_manager.wandb_logger_kwargs.name=dolly_sft_run', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss']
Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 243, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 129, in main
    ptl_model, updated_cfg = load_from_nemo(
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 98, in load_from_nemo
    model = cls.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 464, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 51, in restore_from
    return super().restore_from(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1172, in restore_from
    checkpoint = checkpoint_io.load_checkpoint(tmp_model_weights_dir, sharded_state_dict=checkpoint)
  File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 78, in load_checkpoint
    return dist_checkpointing.load(
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 133, in load
    validate_sharding_integrity(nested_values(sharded_state_dict))
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 425, in validate_sharding_integrity
    torch.distributed.all_gather_object(all_sharding, sharding)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2310, in all_gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2724, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1961, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 608 bytes

The text was updated successfully, but these errors were encountered:

vecorro added the bug Something isn't working label Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT not working on nemo:24.05.01 container #236

SFT not working on nemo:24.05.01 container #236

vecorro commented Jul 13, 2024

SFT not working on nemo:24.05.01 container #236

SFT not working on nemo:24.05.01 container #236

Comments

vecorro commented Jul 13, 2024