Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use device_name instead of device index to support other device #3933

Merged
merged 4 commits into from
Jul 14, 2023

Conversation

hipudding
Copy link
Contributor

Create tensor with device=Integer will always choose cuda as its deivce for current pytorch version(2.1), other device should use device={device}:{index}.

Change get_accelerator().current_device() to
get_accelerator().current_device_name() to support other devices.

@tjruwase
Copy link
Contributor

@hipudding, can you please share more details on the failure this is supposed to fix? Each accelerator can implement current_device() as appropriate, so I don't see how this API results in cuda devices. Thanks!

Create tensor with device=Integer will always choose cuda as its
deivce for current pytorch version(2.1), other device should use
device={device}:{index}.

Change get_accelerator().current_device() to
get_accelerator().current_device_name() to support other devices.
@hipudding
Copy link
Contributor Author

hipudding commented Jul 13, 2023

@hipudding, can you please share more details on the failure this is supposed to fix? Each accelerator can implement current_device() as appropriate, so I don't see how this API results in cuda devices. Thanks!

Yes. For example, I'm using NPU as the backend and want to create a empty tensor.

a_tensor = tensor.empty(10, device=get_accelerator().current_device())

For every accelerator, current_device() will return the current index of the backend. Suppose we are using npu:1, current_device() will return Integer 1. Then, the code above equals to:

a_tensor = tensor.empty(10, device=1)

But pytorch will use Cuda as it's backend if device is a Integer.
image

If these code want to work with every backend, it should specify the device name, so I changed current_device() to current_device_name(), which will return a device name and its index.

a_tensor = tensor.empty(10, device=get_accelerator().current_device_name())

==

a_tensor = tensor.empty(10, device='npu:1')

@tjruwase
Copy link
Contributor

@hipudding, I see your point. I agree that this quite an incovenience of torch, but I was suggesting that rather than changing deepspeed code, you could follow xpu_accelerator implementation. That is working without needing this PR.

@tjruwase
Copy link
Contributor

On second thought, perhaps I should confirm what xpu_accelerator is actually doing. @delock, how do you avoid the problem solved by this PR? Thanks.

@delock
Copy link
Collaborator

delock commented Jul 14, 2023

current_device_name() should be the right API to use. current_deivce() will return an index, which should only be used when a number is needed, i.e. iterate all device indexes.

I see these codes are introduced from zero++ (#3784), it should be misuse of current_device(). We have not started to evaluate zero++ yet, so probably why xpu didn't encounter this issue.

@tjruwase
Copy link
Contributor

@delock, thanks for the explanation. That makes sense.

@tjruwase tjruwase added this pull request to the merge queue Jul 14, 2023
Merged via the queue into microsoft:master with commit 7528035 Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants