Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add Accelerator.is_available() interface requirement #11818

Closed
ananthsub opened this issue Feb 9, 2022 · 10 comments · Fixed by #11797
Closed

[RFC] Add Accelerator.is_available() interface requirement #11818

ananthsub opened this issue Feb 9, 2022 · 10 comments · Fixed by #11797
Labels
accelerator breaking change Includes a breaking change design Includes a design discussion feature Is an improvement or enhancement
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Feb 9, 2022

🚀 Feature

Motivation

Such functionality on the Accelerator abstraction would:

  • Enable automatic hardware selection without duplicating code across Trainer & individual accelerator implementations.
  • Simplify the accelerator connector logic and rewrite effort: Rewrite accelerator_connector #11448
  • Enable automatic runtime checking of hardware availability during execution
  • Provide consistency with how the Trainer auto-detects the cluster environments natively supported by the framework. The corollary here is ClusterEnvironment.detect

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/plugins/environments/cluster_environment.py#L43-L46

https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/trainer/connectors/accelerator_connector.py#L810-L812

Pitch

class Accelerator(ABC):

    @staticmethod
    @abstractmethod
    def is_available() -> bool:
        """Detect if the hardware is available."""
    
    def setup_environment(self, root_device: torch.device) -> None:
        """Setup any processes or distributed connections.
        This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
        environment before setup is complete.
        Raises:
            RuntimeError:
                If corresponding hardware is not found.
        """
        if not self.is_available():
            raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")
class CPUAccelerator(Accelerator):
    @staticmethod
    def is_available() -> bool:
        """CPU is always available for execution."""
        return True
class GPUAccelerator(Accelerator):
   @staticmethod
    def is_available() -> bool:
        return torch.cuda.is_available() and torch.cuda.device_count() > 0 

and so on

See a more-detailed implementation here: #11797 for what this looks like in practice

To support Trainer(accelerator="auto") this is what the logic simplifies to:

for acc_cls in (GPUAccelerator, TPUAccelerator, IPUAccelerator, CPUAccelerator):
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

this could be even further simplified if we offered an AcceleratorRegistry, such that the Trainer/AcceleratorConnector didn't need to hardcode the list of accelerators to detect:

for acc_cls in AcceleratorRegistry.impls:
    if acc_cls.is_available():
        return acc_cls()
return CPUAccelerator() # fallback to CPU

Alternatives

Some other alternatives exist here:
#11799
#11798

Issues with these approaches:

  • Also breaking changes: simply instantiating the accelerator could raise a runtime error if the device isn't available.
  • The bigger issue to me is that it does not ease support for Trainer(accelerator="auto"). The accelerator connector needs to hardcode & re-implement each of the device checks to determine which Accelerator to even instantiate.

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @tchaton @justusschock @awaelchli @akihironitta @rohitgr7

@ananthsub ananthsub added feature Is an improvement or enhancement design Includes a design discussion breaking change Includes a breaking change accelerator labels Feb 9, 2022
@ananthsub ananthsub added this to the 1.6 milestone Feb 9, 2022
@ananthsub ananthsub changed the title [RFC] Add Accelerator.is_available() interface requirement [RFC] Add Accelerator.is_available() interface requirement Feb 9, 2022
@ananthsub ananthsub linked a pull request Feb 9, 2022 that will close this issue
11 tasks
@tchaton
Copy link
Contributor

tchaton commented Feb 9, 2022

Looks great, I like this.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 9, 2022

looks good!

one quick ques: why inside setup_environment here?

    def setup_environment(self, root_device: torch.device) -> None:
        """Setup any processes or distributed connections.
        This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
        environment before setup is complete.
        Raises:
            RuntimeError:
                If corresponding hardware is not found.
        """
        if not self.is_available():
            raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.")

and not inside init? I think we had a discussion regarding this in one of your PRs.

@ananthsub
Copy link
Contributor Author

ananthsub commented Feb 9, 2022

and not inside init? I think we had a discussion regarding this in one of your PRs.

@rohitgr7 great question. ultimately, the accelerator is mimicking torch device semantics.

PyTorch users today can create a torch.device("cuda") on a host without GPUs available without any error being raised.
An error only appears when someone tries to do CUDA operations, like moving the tensor to the device.

d = torch.device("cuda")  # this is fine
torch.tensor(0.0).to(d)  # this raises an error

To me, this is very similar to the Accelerator init (device creation) and Accelerator.setup_environment (moving data to the device).

To preserve this style, I don't raise an exception in init, and instead do it at setup_environment.
It also makes instantiating, mocking, and testing the different devices a lot easier to do

@justusschock
Copy link
Member

justusschock commented Feb 9, 2022

@ananthsub I see the reasons for that, but on the other hand I usually am in favor of raising errors as early as possible. And for testing we will probably have to mock the torch.cuda package anyways.

Not sure which one has more pros or cons, just wanted to mention this :) But I definitely like the idea in general :)

@ananthsub
Copy link
Contributor Author

@justusschock I agree, failing fast is very important.

To note:

I'd argue that Strategy.setup_environment should happen before prepare_data, as currently other existing misconfiguration checks take place here that could happen sooner: https://github.com/PyTorchLightning/pytorch-lightning/blob/b34d8673a9e77d25a826b7ada31d1204dfc818ab/pytorch_lightning/accelerators/cpu.py#L28-L36

@awaelchli
Copy link
Contributor

Makes sense

And separately, we could move setup_environment to happen even sooner inside of Trainer._run

Yes. Related but off-topic, there is a small misalignment in calling setup_environment() between ddp and ddp spawn #11073 but this can be easily resolved after #11643

@four4fish
Copy link
Contributor

Yeah make sense! We can call accelerator.is_available() when we init_accelerator() in accel_connector. I will modify the rewrite PR after these get merged

@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 9, 2022

if we move setup_environmentup, will it be okay that self.device will be available for users inside prepare_data, configure_callbacks hooks?

@ananthsub
Copy link
Contributor Author

if we move setup_environmentup, will it be okay that self.device will be available for users inside prepare_data, configure_callbacks hooks?

self.device would not be available because the module would not be moved to the device by then. setup_environment doesn't control this: model_to_device is what determines it.

But in general the environment setup should be the first thing that happens. This minimizes all control flow changes between processes created externally vs spawned by the trainer, as well as single process vs distributed training

@rohitgr7
Copy link
Contributor

rohitgr7 commented Feb 9, 2022

oh! yeah sorry. I did a quick run... looks like it defaults to device('cpu')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator breaking change Includes a breaking change design Includes a design discussion feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants