[RFC] Future of `gpus/ipus/tpu_cores` with respect to `devices` #10410

SeanNaren · 2021-11-08T15:39:28Z

Proposed refactoring or deprecation

Currently we have two methods to specifying devices. Let's take GPUs for example:

The standard case that we've all grown used to and are mostly aware of.

trainer = Trainer(gpus=2)

Introduced in 1.5, tries to make the number of devices agnostic. This means if you specify accelerator='tpu' we automatically know to use 2 TPU cores.

trainer = Trainer(devices=2, accelerator='gpu')

Recently, it has come up in #10404 (comment) that we may want to deprecate and prevent further device specific names from appearing in the Trainer (such as hpus).

Related conversation #9053 (comment)

I see two options:

🚀 We keep both device specific arguments (gpus tpu_cores ipus for the Trainer) and devices
👀 We drop gpus tpu_cores ipus in the future and fully rely on devices. (Potentially this would likely be done in Lightning 2.0, instead of after 2 minor releases)

cc @kaushikb11 @justusschock @ananthsub @awaelchli

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-11-08T20:08:55Z

IMO we should follow the contributing guidelines: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md#main-core-value-one-less-thing-to-remember

Having multiple options in the public API to do the same thing is really confusing.

I'm in favor of devices=X, accelerator=Y since it's clearer how extensible this can be.

four4fish · 2021-11-08T20:11:52Z

+1, totally agree

Current device related flags are confusing. Multiple flags partially overlap and interfere each other. When multiple flags passed in, we define prioritize and ignore some of the flags.

For example:
gpu=2, device=3 device will be ignored.
gpu=2, cpu=2, accelerator=cpu what will happen? I think cpu with num_process=2?

I prefer option 2, drop gpus, tpu_cores, ipus in the future and fully rely on devices
And can we have devices be int, not set =auto?

With this option: accelerator flag for device_type, devices (probably rename to devices_num?) for device_number. It's also scalable for new device types like hpus

williamFalcon · 2021-11-09T03:56:54Z

I think going from this:

Trainer(gpus=2)

to

Trainer(devices=2, accelerator='gpu')

is a major step backwards in usability. now users have to dig into docs to understand how to use things. it definitely violates the "one-less-thing-to-remember" part of the API.

I guess, I'm just wondering why we're exploring this? I thought we were already pretty stable on the device API stuff

justusschock · 2021-11-09T08:23:32Z

@williamFalcon The more kinds of accelerators we get, the more flags we will also have. Switching from Trainer(gpus=8) to Trainer(tpu_cores=8) also requires users to dig through the docs. Actually I find it easier to have Trainer(devices=2, accelerator='gpu'/'tpu') as the flags stay the same, it is easier to remember and also scaling better. So personally this would be the "one-less-thing-to-remember" for me.

Also I suspect, we would likely have the accelerator defaulting to 'auto' then which means that Trainer(devices=8) would run on gpu if available, on tpu if available and if no special accelerator is available it would fall back to cpu.

tchaton · 2021-11-09T11:13:57Z

@williamFalcon As @justusschock shared, the previous approach doesn't scale well and makes discoverability harder.

Furthermore, the new API provides an auto as follows:

Trainer(devices="auto", accelerator="auto")

which would make the code runnable on every hardware without any code changes. Which isn't possible with the previous API.

And we could even support num_nodes discovery too.

Trainer(devices="auto", accelerator="auto", num_nodes="auto")

SeanNaren · 2021-11-11T19:19:30Z

To address the discoverability issue, isn't it common to import the Trainer and see what parameters are available? Isn't this more common that going to the docs to find the parameter?

I opened the issue as I felt it was important as a community we come to an agreement as the idea was floating around a few PRs (with inconsistent agreement). It's important to have one single direction here (especially as we introduce other accelerators). I do strongly disagree with removing gpus/tpu_cores/ipus/hpus/cpus from the Trainer for primarily ease/discoverability.

I think it would be beneficial to try get community votes on this, so maybe a post on our General slack channel is warranted?

ananthsub · 2021-11-12T07:12:18Z

Even something like gpus as lightning defines them today is ambiguous. PyTorch also supports AMD GPUs: https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/

But this isn't supported at all with Lightning because when specifying gpus on the Trainer constructor, the Trainer assumes NVidia & CUDA are being used. On the other hand, PyTorch's device allows for more backends.

In my head the accelerator in Lightning maps to the torch device being used. By using the same sematics PyTorch offers, Lightning can keep parity more easily and smoothen the transition for users coming from vanilla PyTorch

tchaton · 2021-11-15T11:10:38Z

Adding to the conversation and @ananthsub comment. In this issue, a user is requesting to adding MIG for A100-like a machine: #10529.

This is another example of arguments like gpus, tpu-cores can grow out of control, and the need for a single device / accelerator arguments.

daniellepintz · 2021-11-16T18:55:02Z

@tchaton do we have a consensus to move forward with this issue?

zippeurfou · 2021-11-17T14:12:41Z

Maybe thought that is a bit different there.
Going back to @williamFalcon argument that:
Trainer(gpus=2) to Trainer(devices=2, accelerator='gpu') is more work for the user.
My question here is that as far as I understand a correct me if I am wrong you have CPU and then either gpu/tpu/hpu/...
That is why you can have something as follow:
Trainer(devices=2, accelerator='gpu') where accelerator is only one combination.
I also assume that our users "don't" want to care if it is gpu/tpu/hpu.. as their code will remain the same as long as it is not CPU and even if it is CPU PL help you there to make it seamless.
Finally, we can automatically detect what kind of accelerator you have available today with auto.
That being said what if we had a "wrapper" around anything that is non CPU such that we can keep the same structure while making it "easy" for the users.
ie. Trainer(cpus=2,xpus=2) this will automatically find if x is gpu/tpu/hpu.
Then we allow default to be auto ie. Trainer(cpus=null,xpu=null) or could use -1 for example.

t-vi · 2021-11-18T09:31:08Z

To take the pro-Accelerator argument to the extreme (also with the "fractional" devices), how about not splitting devices= and accelerator=?

If instantiating Accelerator all the time is too much of a hassle for @williamFalcon 's taste (I never liked the configuration part of tf sessions, either, and there is a good reason why PyTorch doesn't force you to do device = torch.Device("cuda") all over the thing but will just take "cuda"), how about:

Trainer(devices=2)   # I want two of whatever is available (so GPUs > CPUs in preference, but only the same kind.

Occasions where "casual users" will have TPU GPU and IPU in the same box will be rare enough...
This is breaking because it would make "GPU if available" the default :( (though I never understood why it is not).

For more elaborate configs, one could have

Trainer(devices=Accelerator("cuda", 2))

My apologies for adding another color of shed, but to my mind, there are these cases we want to cater to:

The easy one! Needing to instantiate Accelerator is a bit more API for people to remember than just gpus=.... Personally, I have to concentrate really hard to know how many c and l to put in there, too.
The turbopropower-user: Would it not be more consistent and flexible to have Accelerator as the single truth about what their thing trains on? I certainly like to consider all my clusters of 512 DGXes for training BERT in 30 seconds a single device...
The unknown future. I think we'll see a lot more blur to "thing the training runs on = n devices of type a" that the proposed API of devices=2, accelerator=... suggests.

Best regards

Thomas

tchaton · 2021-11-18T10:00:07Z

To add to @t-vi comment,

I believe the accelerator could be set to 'auto' by default as it is quite unlikely there is an overlapping machine with both GPUs and TPUs available.

So the hardware is totally abstracted and this provides an experience closer to Jax with their auto platform detection

Trainer(gpus=8) or Trainer(tpu_cores=8)  or Trainer(cpu=8)  or Trainer(hpus=8)  or Trainer(ipus=8) ...

would be replaced directly with:

Trainer(devices=8)

If a user has a machine with a GPU and wants to debug on CPU, he would simply add the accelerator to force the decision-making.

Trainer(devices=2, accelerator="cpu")

By the most critical point is:

I think we'll see a lot more blur to "thing the training runs on = n devices of type a" that the proposed API of devices=2, accelerator=... suggests.

I believe this API would need to provide a smarter hardware detection mechanism for MIG hardware.

dlangerm · 2021-11-18T17:06:06Z

Coming from both the High-Performance and embedded spaces, I'll weigh in here with some general thoughts.

Often with large clusters, we have models and/or datasets which can't fit on a single node. If the API says gpus=2 or tpus=2, what control do I have over where those devices are or which devices get used for which parts of the model? Should PTL support this type of deployment at all?
There are certain accelerators I might want to use which are only useful for inference but not training. FPGAs, for example, are really great for low-latency inference, but with the above API, do I have to instantiate a "Trainer" to use a device for inference? This makes little sense to me as a user. Is this something that PTL wants to support? If so, a rework is in order.
There is research being done on heterogeneous architectures which have GPUs, DSPs, etc. available on a single node. The distribution of work and communication between these devices is non-trivial and a scheduling nightmare, but it's not too far off. Virtualized communication technology like CXL and composable infrastructure like Liquid will enable these types of nodes to "exist" in a cloud or on-prem cluster. I think PTL should be forward thinking and have these types of setups in mind, especially if it is to be adopted by the research community as a usable tool.
A "device" as we think of it today (a GPU, a CPU) will likely be upended when in-memory processors come to the mainstream. (What is a "memory" device? How many "cores" does it have? It quickly loses any meaning). What about the Xilinx Versal architecture? It has many compute cores in a dynamic software-defined network fabric connected to an FPGA. It's one "device", but it's also many.

To the above points, I have a couple of suggestions:

The trainer should be agnostic to what it is executing on. It should be the object facilitating and orchestrating the training session (it is a trainer after all), but it shouldn't care what device is on the other end. If it does have knowledge of device-specifics, then as many of the users above pointed out, the API and argument count/complexity will explode if even just a few accelerators become mainstream and anything but basic training strategies are to be supported.
We should have an Accelerator API describing a device, its location, and its features. The average user shouldn't have to use this API at all, or even know it exists, and sane defaults should be set. However, it should be flexible enough to be used for cutting-edge device research. Where this accelerator API fits in the ecosystem is going to need to be decided by the community, but it shouldn't be passed to the trainer because if I have a device which is only for inference acceleration, then it makes no sense to create a trainer.

I am very interested to see where this discussion goes, and I apologize for the ramble.

carmocca · 2021-11-18T17:22:23Z

This discussion has extended to other related points, but to give my opinion the original question, I fully agree with @tchaton's API vision here: #10410 (comment).

Where the original gpus, tpus, ... are deprecated and removed.

I don't think adding new options xpus=..., or devices=Accelerator("cuda", 2) should be in the cards anymore, as the new devices=2, format was just introduced in 1.5 and we would be once again deprecating this newly introduced functionality for a different thing. There's no clear winner here and we just need to choose one approach.

do I have to instantiate a "Trainer" to use a device for inference

it shouldn't be passed to the trainer because if I have a device which is only for inference acceleration, then it makes no sense to create a trainer.

Keep in mind that the Trainer has that name since it's been the core part of Lightning since the beginning, but it's way more than a "trainer" and could be thought of as an engine, for example, we have validate, test, and predict which are split from the training procedure

williamFalcon · 2021-12-03T11:52:39Z

A lot of great inputs! Let me start off by summarizing:

The current API was built when only GPUs were supported. Then TPUs were added. And now a few years later, we live in a world where more alternatives are starting to emerge. This is the current API.

Trainer(gpus=2)
Trainer(tpus=2)

But now, we live in a world where more than GPU|TPU devices are coming out (HPU, etc...). In this case, the proposal is to modify the API like so:

Trainer(devices=2, accelerator='tpu')

Well... we also introduced the 'auto' flag, so the actual default call would look like this:

Trainer(devices=2)

# because Trainer(devices=2, accelerator='auto') is the default

@t-vi also brought up the alternative that there could be a class in the event that configs get unwieldy

Trainer(accelerator=Accelerator("cuda", 2))

@dlangerm also brought up that in certain complex scenarios:

Multinode training (which we've supported from day 1 @dlangerm, and you specify the num_nodes argument). Today we already support selecting many configurations here, so i'm not sure what a relevant usecase is.
Yes, PL already supports inference. We can think about configurations during inference needing a "Trainer" (@tchaton @carmocca maybe "Trainer" needs to be renamed in 2.0)
Heterogenous hardware is an awesome upcoming research challenge that we'll be excited to tackle next year. But today, it's premature until research matures a bit more.
In-memory processors also sound promising. If you know of a real use case, happy to collaborate on working out how to do something like that.
@dlangerm we do have an accelerator API (it's been there since 1.0)... it's just used internally and not exposed to the user.

Decision

So, with all that said, if there's broader community support for moving from:

Trainer(gpus=2)

to:

Trainer(devices=2, accelerator='tpu')

# default is auto
Trainer(devices=2)

Then I'm happy to back this option as it is more scalable and my only concern is "having to only remember one thing"...
So, I'd love to hear more from the community about the effect on usability.

if there are no major quelms about this and everyone's excited, let's roll it out for 2.0
cc @tchaton @carmocca @ananthsub @daniellepintz

daniellepintz · 2021-12-08T01:34:43Z

I have a question about this; if we want to roll this out for 2.0 when can we start working on it? Could we start working on it now for example?

four4fish · 2021-12-08T01:40:36Z

@tchaton @awaelchli @ananthsub What's you guys' thought on when will be the right time for 2.0? Should Accelerator Refactor and stable accelerator be part of the 2.0?
I think it's better to have big changes at once. I will prefer having accelerator stable version and the flags changes addressed in the same version. It's easier to communicate with users and reduce future confusings.

tchaton · 2021-12-08T09:30:54Z

Hey @daniellepintz @four4fish.

Yes, I agree with you both. I don't believe this change requires a Lightning 2.0 as this is a natural evolution of Lightning becoming hardware-agnostic directly at the Trainer level.

IMO, I would like to action this change for v1.6. @ananthsub @awaelchli @carmocca @kaushikb11 @justusschock Any thoughts on this ? If we are all positive to make this change, I will mark this issue as let's do it.

kaushikb11 · 2021-12-08T09:46:06Z

Agreed. We could go ahead with this change for v1.6, along with major Accelerator refactor.

dlangerm · 2021-12-08T14:37:52Z

Given the above decisions, is there a consensus on renaming Trainer to something more appropriate for the "Brain" or "Engine" that is has become?

If these changes are towards a hardware-agnostic API that can be used for either training or inference, Trainer will become very confusing. Even today, creating a Trainer instance to perform Trainer.predict is fairly unintuitive.

justusschock · 2021-12-08T15:00:27Z

@dlangerm this is not really related to this issue, so I won't go into much detail here. Feel free to open a new issue for this discussion.

From my POV, we shouldn't rename the Trainer before 2.0.
Renaming flags is one thing (and we will need to have a pretty long deprecation cycle for those), but renaming the major components as the Trainer or LightningModule would be too much of a breaking change since this could also break the API in all other places as well.

daniellepintz · 2021-12-08T17:26:02Z

Hey @kaushikb11 I saw you assigned yourself to this issue. I was planning on working on the accelerator_connector refactor (#10422) which was blocked by this issue. Am I okay to proceed with accelerator_connector refactor or is that something you were planning on doing?

daniellepintz · 2021-12-14T01:47:45Z

I am working on this in #11040 - do we also want to deprecate num_processes and num_nodes?

justusschock · 2021-12-14T08:46:49Z

I think num_processes yes, but num_nodes we might still need in case of multi-node training

daniellepintz · 2021-12-14T19:00:40Z

Got it, thanks!

SeanNaren added the refactor label Nov 8, 2021

kaushikb11 changed the title ~~[RFC] Future of gpus/ipus/tpus with respect to devices~~ [RFC] Future of gpus/ipus/tpu_cores with respect to devices Nov 8, 2021

kaushikb11 added design Includes a design discussion deprecation Includes a deprecation labels Nov 8, 2021

four4fish mentioned this issue Nov 8, 2021

[RFC] Simplifying the Accelerator Connector logic and flags #10422

Closed

kaushikb11 mentioned this issue Nov 17, 2021

Support passing Accelerator objects to the accelerator flag with devices=x #10592

Closed

kaushikb11 mentioned this issue Nov 18, 2021

[RFC] Future for accelerator and devices default values #10606

Closed

kaushikb11 self-assigned this Dec 8, 2021

four4fish mentioned this issue Dec 9, 2021

1/n Generalize internal checks for Accelerator in Trainer - remove trainer._device_type #11001

Closed

12 tasks

kaushikb11 assigned daniellepintz Dec 10, 2021

daniellepintz mentioned this issue Dec 11, 2021

Deprecate num_processes,gpus, tpu_cores, and ipus from the Trainer constructor #11040

Merged

12 tasks

carmocca mentioned this issue Jan 5, 2022

Provide a Default Parameter for Fit's Checkpoint Restore Path #10573

Closed

four4fish mentioned this issue Feb 8, 2022

Rewrite accelerator_connector #11448

Merged

12 tasks

carmocca added this to Lightning RFCs Feb 28, 2022

carmocca moved this to Accepted in Lightning RFCs Feb 28, 2022

carmocca added this to the 1.7 milestone Feb 28, 2022

awaelchli mentioned this issue Mar 12, 2022

Update tests/models/*.py to use devices instead of gpus or ipus #11470

Merged

12 tasks

DuYicong515 mentioned this issue Mar 24, 2022

Deprecate Trainer.gpus #12436

Merged

12 tasks

awaelchli closed this as completed in #11040 Apr 10, 2022

ananthsub mentioned this issue May 19, 2022

MPS (Mac M1) device support #13102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Future of `gpus/ipus/tpu_cores` with respect to `devices` #10410

[RFC] Future of `gpus/ipus/tpu_cores` with respect to `devices` #10410

SeanNaren commented Nov 8, 2021 •

edited by ananthsub

Loading

ananthsub commented Nov 8, 2021 •

edited

Loading

four4fish commented Nov 8, 2021 •

edited

Loading

williamFalcon commented Nov 9, 2021 •

edited

Loading

justusschock commented Nov 9, 2021

tchaton commented Nov 9, 2021 •

edited

Loading

SeanNaren commented Nov 11, 2021

ananthsub commented Nov 12, 2021 •

edited

Loading

tchaton commented Nov 15, 2021 •

edited

Loading

daniellepintz commented Nov 16, 2021

zippeurfou commented Nov 17, 2021

t-vi commented Nov 18, 2021

tchaton commented Nov 18, 2021 •

edited

Loading

dlangerm commented Nov 18, 2021

carmocca commented Nov 18, 2021 •

edited

Loading

williamFalcon commented Dec 3, 2021 •

edited

Loading

daniellepintz commented Dec 8, 2021

four4fish commented Dec 8, 2021 •

edited

Loading

tchaton commented Dec 8, 2021 •

edited

Loading

kaushikb11 commented Dec 8, 2021

dlangerm commented Dec 8, 2021

justusschock commented Dec 8, 2021

daniellepintz commented Dec 8, 2021

daniellepintz commented Dec 14, 2021

justusschock commented Dec 14, 2021

daniellepintz commented Dec 14, 2021

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Comments

SeanNaren commented Nov 8, 2021 • edited by ananthsub Loading

Proposed refactoring or deprecation

ananthsub commented Nov 8, 2021 • edited Loading

four4fish commented Nov 8, 2021 • edited Loading

williamFalcon commented Nov 9, 2021 • edited Loading

justusschock commented Nov 9, 2021

tchaton commented Nov 9, 2021 • edited Loading

SeanNaren commented Nov 11, 2021

ananthsub commented Nov 12, 2021 • edited Loading

tchaton commented Nov 15, 2021 • edited Loading

daniellepintz commented Nov 16, 2021

zippeurfou commented Nov 17, 2021

t-vi commented Nov 18, 2021

tchaton commented Nov 18, 2021 • edited Loading

dlangerm commented Nov 18, 2021

carmocca commented Nov 18, 2021 • edited Loading

williamFalcon commented Dec 3, 2021 • edited Loading

Decision

daniellepintz commented Dec 8, 2021

four4fish commented Dec 8, 2021 • edited Loading

tchaton commented Dec 8, 2021 • edited Loading

kaushikb11 commented Dec 8, 2021

dlangerm commented Dec 8, 2021

justusschock commented Dec 8, 2021

daniellepintz commented Dec 8, 2021

daniellepintz commented Dec 14, 2021

justusschock commented Dec 14, 2021

daniellepintz commented Dec 14, 2021

[RFC] Future of `gpus/ipus/tpu_cores` with respect to `devices` #10410

[RFC] Future of `gpus/ipus/tpu_cores` with respect to `devices` #10410

SeanNaren commented Nov 8, 2021 •

edited by ananthsub

Loading

ananthsub commented Nov 8, 2021 •

edited

Loading

four4fish commented Nov 8, 2021 •

edited

Loading

williamFalcon commented Nov 9, 2021 •

edited

Loading

tchaton commented Nov 9, 2021 •

edited

Loading

ananthsub commented Nov 12, 2021 •

edited

Loading

tchaton commented Nov 15, 2021 •

edited

Loading

tchaton commented Nov 18, 2021 •

edited

Loading

carmocca commented Nov 18, 2021 •

edited

Loading

williamFalcon commented Dec 3, 2021 •

edited

Loading

four4fish commented Dec 8, 2021 •

edited

Loading

tchaton commented Dec 8, 2021 •

edited

Loading