TorchElastic standalone test silently fails #15418

awaelchli · 2022-10-31T00:22:35Z

Bug

The test case

https://github.com/Lightning-AI/lightning/blob/a008801e25a6e93214b410b68ca1edf2e3ddae14/tests/tests_pytorch/run_standalone_tasks.sh#L27-L32

silently fails in our CI.

Output

Running plugins/environments/torch_elastic_deadlock.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
/__w/2/s/src/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/__w/2/s/src/pytorch_lightning/trainer/trainer.py:1555: PossibleUserWarning: The number of training batches (5) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
/__w/2/s/src/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
[W reducer.cpp:1251] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1251] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 54410) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Observed on latest commit on master.

Expected Behavior

The test should pass. Additionally, should the test fail like it does now, the error should be reported in the CI. We currently grep for SUCCEEDED but it doesn't make the CI fail.

cc @tchaton @rohitgr7 @Borda @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

Borda · 2022-11-07T15:59:27Z

I think this shall be a type script or set -e in the CI workflow
https://github.com/Lightning-AI/lightning/blob/a557952fabafa188578c31596c81c64b3f5e65f6/.azure/gpu-tests-pytorch.yml#L146

carmocca · 2022-11-07T18:05:13Z

@Borda The script sets it already. It's an issue with the test not propagating the failure correctly as only one rank fails. https://github.com/Lightning-AI/lightning/blob/04e1e925daaf18acd3ec1b816575308f76f5f813/tests/tests_pytorch/run_standalone_tasks.sh#L15

Borda · 2022-11-07T20:02:27Z

Yes, I saw it in the script, but not sure what we can do then, as this si quite critical for visibility...

carmocca · 2022-11-08T12:17:23Z

There's nothing we can do other than to fix the test and grep for "error" in its text output as I did to resolve #12474

akihironitta · 2022-11-08T16:17:32Z

Closing this since I don't see anything to address via this issue. The improvement in #15341 by Carlos is reasonable, IMO.

carmocca · 2022-11-08T17:01:26Z

Note that even though we don't have a reliable solution to surface these distributed issues, we could still fix this test in particular so that it passes again.

awaelchli added needs triage Waiting to be triaged by maintainers bug Something isn't working tests environment: torchelastic and removed needs triage Waiting to be triaged by maintainers labels Oct 31, 2022

awaelchli added this to the v1.8.x milestone Oct 31, 2022

Borda added the priority: 1 Medium priority task label Nov 7, 2022

Borda assigned Borda and akihironitta Nov 7, 2022

akihironitta closed this as completed Nov 8, 2022

awaelchli mentioned this issue Nov 9, 2022

Update outdated deadlock detection test #15594

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchElastic standalone test silently fails #15418

TorchElastic standalone test silently fails #15418

awaelchli commented Oct 31, 2022 •

edited by github-actions bot

Loading

Borda commented Nov 7, 2022

carmocca commented Nov 7, 2022

Borda commented Nov 7, 2022

carmocca commented Nov 8, 2022

akihironitta commented Nov 8, 2022

carmocca commented Nov 8, 2022

TorchElastic standalone test silently fails #15418

TorchElastic standalone test silently fails #15418

Comments

awaelchli commented Oct 31, 2022 • edited by github-actions bot Loading

Bug

Output

Expected Behavior

Borda commented Nov 7, 2022

carmocca commented Nov 7, 2022

Borda commented Nov 7, 2022

carmocca commented Nov 8, 2022

akihironitta commented Nov 8, 2022

carmocca commented Nov 8, 2022

awaelchli commented Oct 31, 2022 •

edited by github-actions bot

Loading