RF: How to deal with devices #1331

albertz · 2023-05-19T10:15:40Z

Currently we don't really deal with devices at all in the RETURNN frontend (RF) (#1120). This assumes that we can automatically figure it out. Usually we would perform most computations on GPU if possible.

This issue here is just to keep track of this aspect. It's not really clear yet what we should do about it.

TensorFlow provides a context manager to control the placement of ops on a specific device. And otherwise it mostly tries to automatically move as much as possible to GPU. The computation graph is usually totally independent of any device information. TensorFlow automatically transfers a tensor from one device to another.

PyTorch has a device argument for most ops. When creating tensors initially, you have to choose explicitly where they are placed. Then functions like to (also can be applied for a whole model) are used to move tensors to a different device. This has to be done explicitly. PyTorch >=2.0 also has torch.set_default_device and torch.device can also be used as context manager now.

Before that, PyTorch had more CUDA specific logic, e.g. torch.cuda.device. See here.

PyTorch does not really have the concept of GPU device, but it is all CUDA specific, so it is the "CUDA device", and then it also supports other backends, e.g. MPS. This is different from TF where it would be the GPU device, no matter if this is CUDA or MPS or sth else.

Instead of adding a device argument to all of our functions, we probably also want to have such a global setting, and maybe also context manager. But it's not totally clear how easy we can do it in a way that it works for all backends (e.g. TF and PyTorch).

The interaction with multi GPU training (DDP etc) is also relevant.

The text was updated successfully, but these errors were encountered:

albertz · 2023-07-17T22:09:00Z

I realize now that most of the current code was never actually tested whether it works on GPU. It turns out, there are many problems here and there where some part is on GPU and another is on GPU, and then PyTorch complains.

Not always though. There are a number of PyTorch functions which would not complain. E.g.:

x * 2, x can be on GPU, 2 is obviously on CPU. This is in general fine for binary ops when one part is a scalar. We added such an optimization now that we do not add broadcast dims in this case, as it would break this support.
x[2] works, when x is on GPU and the index 2 here is on CPU.

We also naturally have a mixup due to the seq len, which we have on CPU, as a number of ops require that in any case (which exactly?). So then when we want to create a mask, the mask is actually often required on GPU. E.g. for softmax, we have:

torch.where(mask, tensor.raw_tensor, -inf_value)

I think inf_value is fine to be also on CPU. However, mask here must be on the same device as the tensor.

But at what point should we have the CPU->GPU transfer? Internally it uses range_over_dim. Should that be already on GPU?

Tensor.get_sequence_mask_broadcast: Basically all use cases would require it on GPU.
Tensor.get_sequence_mask_tensor: Same, required on GPU.
Dim.get_mask: Same, required on GPU.

Inside Dim.get_mask, we then have range_over_dim, reduce and compare on the dim and/or sequence lengths.

albertz · 2023-07-18T07:49:05Z

Currently I think a context manager might make sense also on RF level, very much like torch.device.

albertz · 2023-07-18T08:58:36Z

When PyTorch is the backend, we want to avoid any automatic GPU -> CPU copy, or GPU1 -> GPU2 copy. Maybe even CPU -> GPU copy. We should follow the PyTorch design that all such copies must be explicit. However, this is problematic for the sequence lengths, exactly for the reasons stated above.

We might want to have two copies of sequence lengths, once on CPU and once on GPU, maybe dynamically on the first use?

Can RF actually support all different frameworks well when the underlying frameworks have very different semantics regarding the device handling? E.g. in TF, all that is usually automatic. At graph creation time, there is usually even no knowledge about the devices.

In RF, we can maybe follow one approach, e.g. the PyTorch approach which requires it all explicit, and try to fit it then to all the other frameworks. That could work.

#1331

albertz · 2023-10-21T10:32:52Z

We now have some basic support for device logic, and so far, this seems to be enough. I'm closing this.

albertz mentioned this issue May 19, 2023

Frontend API and PyTorch backend #1120

Open

albertz added a commit that referenced this issue Jul 18, 2023

RF, add some device logic

252acef

#1331

albertz closed this as completed Oct 21, 2023

albertz added the returnn-frontend label May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF: How to deal with devices #1331

RF: How to deal with devices #1331

albertz commented May 19, 2023 •

edited

Loading

albertz commented Jul 17, 2023 •

edited

Loading

albertz commented Jul 18, 2023

albertz commented Jul 18, 2023

albertz commented Oct 21, 2023

RF: How to deal with devices #1331

RF: How to deal with devices #1331

Comments

albertz commented May 19, 2023 • edited Loading

albertz commented Jul 17, 2023 • edited Loading

albertz commented Jul 18, 2023

albertz commented Jul 18, 2023

albertz commented Oct 21, 2023

albertz commented May 19, 2023 •

edited

Loading

albertz commented Jul 17, 2023 •

edited

Loading