[RFC] `Add Accelerator.is_available()` interface requirement #11818

ananthsub · 2022-02-09T08:52:09Z

🚀 Feature

Motivation

Such functionality on the Accelerator abstraction would:

Enable automatic hardware selection without duplicating code across Trainer & individual accelerator implementations.
Simplify the accelerator connector logic and rewrite effort: Rewrite accelerator_connector #11448
Enable automatic runtime checking of hardware availability during execution
Provide consistency with how the Trainer auto-detects the cluster environments natively supported by the framework. The corollary here is ClusterEnvironment.detect

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/plugins/environments/cluster_environment.py#L43-L46

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/trainer/connectors/accelerator_connector.py#L810-L812

Pitch

class Accelerator(ABC):

    @staticmethod
    @abstractmethod
    def is_available() -> bool:
        """Detect if the hardware is available."""
    
    def setup_environment(self, root_device: torch.device) -> None:
        """Setup any processes or distributed connections.
        This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
        environment before setup is complete.
        Raises:
            RuntimeError:
                If corresponding hardware is not found.
        """
        if not self.is_available():
            raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")

class CPUAccelerator(Accelerator):
    @staticmethod
    def is_available() -> bool:
        """CPU is always available for execution."""
        return True

class GPUAccelerator(Accelerator):
   @staticmethod
    def is_available() -> bool:
        return torch.cuda.is_available() and torch.cuda.device_count() > 0

and so on

See a more-detailed implementation here: #11797 for what this looks like in practice

To support Trainer(accelerator="auto") this is what the logic simplifies to:

for acc_cls in (GPUAccelerator, TPUAccelerator, IPUAccelerator, CPUAccelerator):
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

this could be even further simplified if we offered an AcceleratorRegistry, such that the Trainer/AcceleratorConnector didn't need to hardcode the list of accelerators to detect:

for acc_cls in AcceleratorRegistry.impls:
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

Alternatives

Some other alternatives exist here:
#11799
#11798

Issues with these approaches:

Also breaking changes: simply instantiating the accelerator could raise a runtime error if the device isn't available.
The bigger issue to me is that it does not ease support for Trainer(accelerator="auto"). The accelerator connector needs to hardcode & re-implement each of the device checks to determine which Accelerator to even instantiate.

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @tchaton @justusschock @awaelchli @akihironitta @rohitgr7

The text was updated successfully, but these errors were encountered:

tchaton · 2022-02-09T09:09:40Z

Looks great, I like this.

rohitgr7 · 2022-02-09T09:16:42Z

looks good!

one quick ques: why inside setup_environment here?

    def setup_environment(self, root_device: torch.device) -> None:
        """Setup any processes or distributed connections.
        This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
        environment before setup is complete.
        Raises:
            RuntimeError:
                If corresponding hardware is not found.
        """
        if not self.is_available():
            raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")

and not inside init? I think we had a discussion regarding this in one of your PRs.

ananthsub · 2022-02-09T09:27:26Z

and not inside init? I think we had a discussion regarding this in one of your PRs.

@rohitgr7 great question. ultimately, the accelerator is mimicking torch device semantics.

PyTorch users today can create a torch.device("cuda") on a host without GPUs available without any error being raised.
An error only appears when someone tries to do CUDA operations, like moving the tensor to the device.

d = torch.device("cuda")  # this is fine
torch.tensor(0.0).to(d)  # this raises an error

To me, this is very similar to the Accelerator init (device creation) and Accelerator.setup_environment (moving data to the device).

To preserve this style, I don't raise an exception in init, and instead do it at setup_environment.
It also makes instantiating, mocking, and testing the different devices a lot easier to do

justusschock · 2022-02-09T09:37:56Z

@ananthsub I see the reasons for that, but on the other hand I usually am in favor of raising errors as early as possible. And for testing we will probably have to mock the torch.cuda package anyways.

Not sure which one has more pros or cons, just wanted to mention this :) But I definitely like the idea in general :)

ananthsub · 2022-02-09T09:43:21Z

@justusschock I agree, failing fast is very important.

To note:

This proposal would not affect the current logic inside of the Trainer's accelerator connector for validating that the specified hardware is in fact available at Trainer construction time.
And separately, we could move setup_environment to happen even sooner inside of Trainer._run: https://github.com/PyTorchLightning/pytorch-lightning/blob/b34d8673a9e77d25a826b7ada31d1204dfc818ab/pytorch_lightning/trainer/trainer.py#L1115

I'd argue that Strategy.setup_environment should happen before prepare_data, as currently other existing misconfiguration checks take place here that could happen sooner: https://github.com/PyTorchLightning/pytorch-lightning/blob/b34d8673a9e77d25a826b7ada31d1204dfc818ab/pytorch_lightning/accelerators/cpu.py#L28-L36

awaelchli · 2022-02-09T12:04:21Z

Makes sense

And separately, we could move setup_environment to happen even sooner inside of Trainer._run

Yes. Related but off-topic, there is a small misalignment in calling setup_environment() between ddp and ddp spawn #11073 but this can be easily resolved after #11643

four4fish · 2022-02-09T17:27:34Z

Yeah make sense! We can call accelerator.is_available() when we init_accelerator() in accel_connector. I will modify the rewrite PR after these get merged

rohitgr7 · 2022-02-09T18:20:21Z

if we move setup_environmentup, will it be okay that self.device will be available for users inside prepare_data, configure_callbacks hooks?

ananthsub · 2022-02-09T18:57:40Z

if we move setup_environmentup, will it be okay that self.device will be available for users inside prepare_data, configure_callbacks hooks?

self.device would not be available because the module would not be moved to the device by then. setup_environment doesn't control this: model_to_device is what determines it.

But in general the environment setup should be the first thing that happens. This minimizes all control flow changes between processes created externally vs spawned by the trainer, as well as single process vs distributed training

rohitgr7 · 2022-02-09T19:07:45Z

oh! yeah sorry. I did a quick run... looks like it defaults to device('cpu')

ananthsub added feature Is an improvement or enhancement design Includes a design discussion breaking change Includes a breaking change accelerator labels Feb 9, 2022

ananthsub added this to the 1.6 milestone Feb 9, 2022

ananthsub changed the title ~~[RFC] Add Accelerator.is_available() interface requirement~~ [RFC] Add Accelerator.is_available() interface requirement Feb 9, 2022

ananthsub linked a pull request Feb 9, 2022 that will close this issue

Add Accelerator.is_available() interface requirement #11797

Merged

11 tasks

ananthsub mentioned this issue Feb 9, 2022

Add Accelerator.is_available() interface requirement #11797

Merged

11 tasks

ananthsub mentioned this issue Feb 9, 2022

Where and when should device availability checks happen? #11831

Closed

ananthsub closed this as completed in #11797 Feb 9, 2022

devozs mentioned this issue Oct 26, 2022

Trainer RuntimeError CUDA error huggingface/transformers#19862

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] `Add Accelerator.is_available()` interface requirement #11818

[RFC] `Add Accelerator.is_available()` interface requirement #11818

ananthsub commented Feb 9, 2022 •

edited

Loading

tchaton commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

ananthsub commented Feb 9, 2022 •

edited

Loading

justusschock commented Feb 9, 2022 •

edited

Loading

ananthsub commented Feb 9, 2022

awaelchli commented Feb 9, 2022

four4fish commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

ananthsub commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

[RFC] Add Accelerator.is_available() interface requirement #11818

[RFC] Add Accelerator.is_available() interface requirement #11818

Comments

ananthsub commented Feb 9, 2022 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

tchaton commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

ananthsub commented Feb 9, 2022 • edited Loading

justusschock commented Feb 9, 2022 • edited Loading

ananthsub commented Feb 9, 2022

awaelchli commented Feb 9, 2022

four4fish commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

ananthsub commented Feb 9, 2022

rohitgr7 commented Feb 9, 2022

[RFC] `Add Accelerator.is_available()` interface requirement #11818

[RFC] `Add Accelerator.is_available()` interface requirement #11818

ananthsub commented Feb 9, 2022 •

edited

Loading

ananthsub commented Feb 9, 2022 •

edited

Loading

justusschock commented Feb 9, 2022 •

edited

Loading