Use device_name instead of device index to support other device #3933

hipudding · 2023-07-12T08:56:53Z

Create tensor with device=Integer will always choose cuda as its deivce for current pytorch version(2.1), other device should use device={device}:{index}.

Change get_accelerator().current_device() to
get_accelerator().current_device_name() to support other devices.

tjruwase · 2023-07-12T15:39:17Z

@hipudding, can you please share more details on the failure this is supposed to fix? Each accelerator can implement current_device() as appropriate, so I don't see how this API results in cuda devices. Thanks!

Create tensor with device=Integer will always choose cuda as its deivce for current pytorch version(2.1), other device should use device={device}:{index}. Change get_accelerator().current_device() to get_accelerator().current_device_name() to support other devices.

hipudding · 2023-07-13T01:14:41Z

@hipudding, can you please share more details on the failure this is supposed to fix? Each accelerator can implement current_device() as appropriate, so I don't see how this API results in cuda devices. Thanks!

Yes. For example, I'm using NPU as the backend and want to create a empty tensor.

a_tensor = tensor.empty(10, device=get_accelerator().current_device())

For every accelerator, current_device() will return the current index of the backend. Suppose we are using npu:1, current_device() will return Integer 1. Then, the code above equals to:

a_tensor = tensor.empty(10, device=1)

But pytorch will use Cuda as it's backend if device is a Integer.

If these code want to work with every backend, it should specify the device name, so I changed current_device() to current_device_name(), which will return a device name and its index.

a_tensor = tensor.empty(10, device=get_accelerator().current_device_name())

==

a_tensor = tensor.empty(10, device='npu:1')

tjruwase · 2023-07-13T19:34:43Z

@hipudding, I see your point. I agree that this quite an incovenience of torch, but I was suggesting that rather than changing deepspeed code, you could follow xpu_accelerator implementation. That is working without needing this PR.

tjruwase · 2023-07-13T19:36:53Z

On second thought, perhaps I should confirm what xpu_accelerator is actually doing. @delock, how do you avoid the problem solved by this PR? Thanks.

delock · 2023-07-14T01:50:00Z

current_device_name() should be the right API to use. current_deivce() will return an index, which should only be used when a number is needed, i.e. iterate all device indexes.

I see these codes are introduced from zero++ (#3784), it should be misuse of current_device(). We have not started to evaluate zero++ yet, so probably why xpu didn't encounter this issue.

tjruwase · 2023-07-14T10:54:34Z

@delock, thanks for the explanation. That makes sense.

hipudding requested review from jeffra, tjruwase, samyam and mrwyattii as code owners July 12, 2023 08:56

hipudding force-pushed the current_device branch from 0272639 to ce41575 Compare July 13, 2023 00:55

hipudding and others added 2 commits July 13, 2023 09:16

Merge branch 'master' into current_device

dbe94cc

Merge branch 'master' into current_device

65a567c

Merge branch 'master' into current_device

74b08ca

tjruwase approved these changes Jul 14, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Jul 14, 2023

Merged via the queue into microsoft:master with commit 7528035 Jul 14, 2023

minchao-sun mentioned this pull request Apr 26, 2024

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() #5464

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use device_name instead of device index to support other device #3933

Use device_name instead of device index to support other device #3933

hipudding commented Jul 12, 2023

tjruwase commented Jul 12, 2023

hipudding commented Jul 13, 2023 •

edited

Loading

tjruwase commented Jul 13, 2023

tjruwase commented Jul 13, 2023

delock commented Jul 14, 2023

tjruwase commented Jul 14, 2023

Use device_name instead of device index to support other device #3933

Use device_name instead of device index to support other device #3933

Conversation

hipudding commented Jul 12, 2023

tjruwase commented Jul 12, 2023

hipudding commented Jul 13, 2023 • edited Loading

tjruwase commented Jul 13, 2023

tjruwase commented Jul 13, 2023

delock commented Jul 14, 2023

tjruwase commented Jul 14, 2023

hipudding commented Jul 13, 2023 •

edited

Loading