-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add Accelerator.is_available()
interface requirement
#11818
Comments
Add Accelerator.is_available()
interface requirement
Looks great, I like this. |
looks good! one quick ques: why inside setup_environment here? def setup_environment(self, root_device: torch.device) -> None:
"""Setup any processes or distributed connections.
This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator
environment before setup is complete.
Raises:
RuntimeError:
If corresponding hardware is not found.
"""
if not self.is_available():
raise RuntimeError(f"{self.__class__.__qualname__} is not configured to run on this hardware.") and not inside init? I think we had a discussion regarding this in one of your PRs. |
@rohitgr7 great question. ultimately, the accelerator is mimicking torch device semantics. PyTorch users today can create a d = torch.device("cuda") # this is fine
torch.tensor(0.0).to(d) # this raises an error To me, this is very similar to the Accelerator init (device creation) and Accelerator.setup_environment (moving data to the device). To preserve this style, I don't raise an exception in init, and instead do it at setup_environment. |
@ananthsub I see the reasons for that, but on the other hand I usually am in favor of raising errors as early as possible. And for testing we will probably have to mock the torch.cuda package anyways. Not sure which one has more pros or cons, just wanted to mention this :) But I definitely like the idea in general :) |
@justusschock I agree, failing fast is very important. To note:
I'd argue that |
Yeah make sense! We can call accelerator.is_available() when we init_accelerator() in accel_connector. I will modify the rewrite PR after these get merged |
if we move |
self.device would not be available because the module would not be moved to the device by then. setup_environment doesn't control this: But in general the environment setup should be the first thing that happens. This minimizes all control flow changes between processes created externally vs spawned by the trainer, as well as single process vs distributed training |
oh! yeah sorry. I did a quick run... looks like it defaults to |
🚀 Feature
Motivation
Such functionality on the Accelerator abstraction would:
ClusterEnvironment.detect
https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/plugins/environments/cluster_environment.py#L43-L46
https://github.com/PyTorchLightning/pytorch-lightning/blob/9e63281a4c4a62f32cad9801a23b63454f8311be/pytorch_lightning/trainer/connectors/accelerator_connector.py#L810-L812
Pitch
and so on
See a more-detailed implementation here: #11797 for what this looks like in practice
To support
Trainer(accelerator="auto")
this is what the logic simplifies to:this could be even further simplified if we offered an AcceleratorRegistry, such that the Trainer/AcceleratorConnector didn't need to hardcode the list of accelerators to detect:
Alternatives
Some other alternatives exist here:
#11799
#11798
Issues with these approaches:
Trainer(accelerator="auto")
. The accelerator connector needs to hardcode & re-implement each of the device checks to determine which Accelerator to even instantiate.Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @akihironitta @rohitgr7
The text was updated successfully, but these errors were encountered: