Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi GPU training issue #1798

Open
sanjuktasr opened this issue Nov 6, 2024 · 10 comments
Open

multi GPU training issue #1798

sanjuktasr opened this issue Nov 6, 2024 · 10 comments

Comments

@sanjuktasr
Copy link

2024-11-05 12:55:26,724 INFO [train.py:1231] (0/2) Training will start from epoch : 1
2024-11-05 12:55:26,725 INFO [train.py:1243] (0/2) Training started
2024-11-05 12:55:26,726 INFO [train.py:1253] (0/2) Device: cuda:0
2024-11-05 12:55:26,728 INFO [train.py:1265] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-05 12:55:26,728 INFO [train.py:1267] (0/2) About to create model
2024-11-05 12:55:26,733 INFO [train.py:1231] (1/2) Training will start from epoch : 1
2024-11-05 12:55:26,734 INFO [train.py:1243] (1/2) Training started
2024-11-05 12:55:26,734 INFO [train.py:1253] (1/2) Device: cuda:1
2024-11-05 12:55:26,736 INFO [train.py:1265] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-05 12:55:26,736 INFO [train.py:1267] (1/2) About to create model
2024-11-05 12:55:26,998 INFO [train.py:1271] (0/2) Number of model parameters: 23627887
2024-11-05 12:55:27,047 INFO [train.py:1271] (1/2) Number of model parameters: 23627887
2024-11-05 12:55:27,934 INFO [train.py:1286] (0/2) Using DDP
2024-11-05 12:55:27,986 INFO [train.py:1286] (1/2) Using DDP
Traceback (most recent call last):
File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1530, in
main()
File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1521, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGTERM
WARNING: script canceled externally (UI, API)

@csukuangfj
Copy link
Collaborator

How large is your.GPU RAM?

@sanjuktasr
Copy link
Author

24 gb each for 2 gpus

@csukuangfj
Copy link
Collaborator

can you reproduce it?

@sanjuktasr
Copy link
Author

sanjuktasr commented Nov 7, 2024

2024-11-07 13:12:14,793 INFO [train.py:1120] (1/2) Device: cuda:1
2024-11-07 13:12:14,793 INFO [train.py:1120] (0/2) Device: cuda:0
2024-11-07 13:12:14,797 INFO [train.py:1132] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7ff6d891905ff482364f2d0015867b00d89dd8c7', 'k2-git-date': 'Fri Jun 16 12:10:37 2023', 'lhotse-version': '1.16.0.dev+git.aa073f6a.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'fc2df07-dirty', 'icefall-git-date': 'Wed Aug 16 20:02:41 2023', 'icefall-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/icefall', 'k2-path': '/home/sanjukta/anaconda3/envs/zipf1/lib/python3.9/site-packages/k2/init.py', 'lhotse-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/lhotse/lhotse/init.py', 'hostname': 'asus-System-Product-Name', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-mmi_online/30_05_2024'), 'bpe_model': 'data/8k/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/8k/fbank/', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-07 13:12:14,797 INFO [train.py:1134] (0/2) About to create model
2024-11-07 13:12:14,797 INFO [train.py:1132] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7ff6d891905ff482364f2d0015867b00d89dd8c7', 'k2-git-date': 'Fri Jun 16 12:10:37 2023', 'lhotse-version': '1.16.0.dev+git.aa073f6a.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'fc2df07-dirty', 'icefall-git-date': 'Wed Aug 16 20:02:41 2023', 'icefall-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/icefall', 'k2-path': '/home/sanjukta/anaconda3/envs/zipf1/lib/python3.9/site-packages/k2/init.py', 'lhotse-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/lhotse/lhotse/init.py', 'hostname': 'asus-System-Product-Name', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-mmi_online/30_05_2024'), 'bpe_model': 'data/8k/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/8k/fbank/', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-07 13:12:14,798 INFO [train.py:1134] (1/2) About to create model
2024-11-07 13:12:15,083 INFO [train.py:1138] (0/2) Number of model parameters: 23285615
2024-11-07 13:12:15,085 INFO [train.py:1138] (1/2) Number of model parameters: 23285615
2024-11-07 13:12:15,984 INFO [train.py:1153] (1/2) Using DDP
2024-11-07 13:12:16,069 INFO [train.py:1153] (0/2) Using DDP
I executed the same code , the code seems to hang at using DDP, and no progress after then. waited for 5 minutes to check but no progress. code running on GPUs individually

@csukuangfj
Copy link
Collaborator

Are you able to reproduce it with librispeech?

@sanjuktasr
Copy link
Author

yes. this is with librispeech only

@csukuangfj
Copy link
Collaborator

then why the data manifest dir is data/8k/fbank in your log?

could you tell us what changes you have made?

@sanjuktasr
Copy link
Author

sanjuktasr commented Nov 7, 2024

I am using different data but codebase is same as librispeech. no changes wise especially in training.

@csukuangfj
Copy link
Collaborator

what is the duration distribution of your data?

are you able to reproduce it with the librispeech dataset?

@sanjuktasr
Copy link
Author

It is a small experimental dataset for testing codebases under librispeech. The training is running on single GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants