Error when only subgraph forwarded. How to set allow_unused for DDP train mode? #1551

ashkanaev · 2024-05-29T13:05:10Z

ashkanaev
May 29, 2024

I got an error when only subgraph forwarded using DDP training mode.

Dataloader provides batches with annotation only for several heads for one step, and leads to gradient calculation only for subgraph.

Env settings

OrderedDict([('sys.platform', 'linux'), ('Python', '3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1,2,3', 'NVIDIA RTX 6000 Ada Generation'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'), ('GCC', 'gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0'), ('PyTorch', '2.0.1'), ('PyTorch compiling details', 'PyTorch built with:\n  - GCC 9.3\n  - C++ Version: 201703\n  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications\n  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n  - LAPACK is enabled (usually provided by MKL)\n  - NNPACK is enabled\n  - CPU capability usage: AVX2\n  - CUDA Runtime 11.8\n  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n  - CuDNN 8.7\n  - Magma 2.6.1\n  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.2'), ('OpenCV', '4.9.0'), ('MMEngine', '0.10.1')])

mmdetection -- 3.2.0
mmcv -- 2.1.0
mmengilne -- 0.10.0

Error :

Exception has occurred: RuntimeError

Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 

If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

Parameters which did not receive grad for rank 0: pipeline.emb_head.classification_head.bias, pipeline.emb_head.classification_head.weight, pipeline.emb_head.layers.4.bias, pipeline.emb_head.layers.4.weight, ..... 
Parameter indices which did not receive grad for rank 0: 45 4 ....

File '..mmengine/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    results = self(**data, mode=mode)
  File "../mmengine/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self._run_forward(data, mode='loss')
  File "../mmengine/mmengine/runner/loops.py", line 301, in run_iter
    outputs = self.runner.model.train_step(
  File "../mmengine/mmengine/runner/loops.py", line 278, in run
    self.run_iter(data_batch)
  File "../mmengine/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "../mmdetection/tools/train.py", line 127, in main
    runner.train()
  File "../mmdetection/tools/train.py", line 131, in <module>
    main()

ModelWrapper used

cfg = dict(
    model_wrapper_cfg=dict(
        type='MMDistributedDataParallel',
        find_unused_parameters=True,
    ), 
)

Possible solution could be link. passing allow_grad to autograd or distributed.autograd. But i need a little guidance to find the right code module

And one more important thing - train code works fine with single gpu (without DDP).

ashkanaev · 2024-05-29T14:41:59Z

ashkanaev
May 29, 2024
Author

i found an issue.
Error in train cfg: in the end of config file find_unused_parameters was set to False.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when only subgraph forwarded. How to set allow_unused for DDP train mode? #1551

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Error when only subgraph forwarded. How to set allow_unused for DDP train mode? #1551

ashkanaev May 29, 2024

Replies: 1 comment

ashkanaev May 29, 2024 Author

ashkanaev
May 29, 2024

ashkanaev
May 29, 2024
Author