Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: How to deal with devices #1331

Closed
albertz opened this issue May 19, 2023 · 4 comments
Closed

RF: How to deal with devices #1331

albertz opened this issue May 19, 2023 · 4 comments

Comments

@albertz
Copy link
Member

albertz commented May 19, 2023

Currently we don't really deal with devices at all in the RETURNN frontend (RF) (#1120). This assumes that we can automatically figure it out. Usually we would perform most computations on GPU if possible.

This issue here is just to keep track of this aspect. It's not really clear yet what we should do about it.

TensorFlow provides a context manager to control the placement of ops on a specific device. And otherwise it mostly tries to automatically move as much as possible to GPU. The computation graph is usually totally independent of any device information. TensorFlow automatically transfers a tensor from one device to another.

PyTorch has a device argument for most ops. When creating tensors initially, you have to choose explicitly where they are placed. Then functions like to (also can be applied for a whole model) are used to move tensors to a different device. This has to be done explicitly. PyTorch >=2.0 also has torch.set_default_device and torch.device can also be used as context manager now.

Before that, PyTorch had more CUDA specific logic, e.g. torch.cuda.device. See here.

PyTorch does not really have the concept of GPU device, but it is all CUDA specific, so it is the "CUDA device", and then it also supports other backends, e.g. MPS. This is different from TF where it would be the GPU device, no matter if this is CUDA or MPS or sth else.

Instead of adding a device argument to all of our functions, we probably also want to have such a global setting, and maybe also context manager. But it's not totally clear how easy we can do it in a way that it works for all backends (e.g. TF and PyTorch).

The interaction with multi GPU training (DDP etc) is also relevant.

@albertz
Copy link
Member Author

albertz commented Jul 17, 2023

I realize now that most of the current code was never actually tested whether it works on GPU. It turns out, there are many problems here and there where some part is on GPU and another is on GPU, and then PyTorch complains.

Not always though. There are a number of PyTorch functions which would not complain. E.g.:

  • x * 2, x can be on GPU, 2 is obviously on CPU. This is in general fine for binary ops when one part is a scalar. We added such an optimization now that we do not add broadcast dims in this case, as it would break this support.
  • x[2] works, when x is on GPU and the index 2 here is on CPU.

We also naturally have a mixup due to the seq len, which we have on CPU, as a number of ops require that in any case (which exactly?). So then when we want to create a mask, the mask is actually often required on GPU. E.g. for softmax, we have:

torch.where(mask, tensor.raw_tensor, -inf_value)

I think inf_value is fine to be also on CPU. However, mask here must be on the same device as the tensor.

But at what point should we have the CPU->GPU transfer? Internally it uses range_over_dim. Should that be already on GPU?

  • Tensor.get_sequence_mask_broadcast: Basically all use cases would require it on GPU.
  • Tensor.get_sequence_mask_tensor: Same, required on GPU.
  • Dim.get_mask: Same, required on GPU.

Inside Dim.get_mask, we then have range_over_dim, reduce and compare on the dim and/or sequence lengths.

@albertz
Copy link
Member Author

albertz commented Jul 18, 2023

Currently I think a context manager might make sense also on RF level, very much like torch.device.

@albertz
Copy link
Member Author

albertz commented Jul 18, 2023

When PyTorch is the backend, we want to avoid any automatic GPU -> CPU copy, or GPU1 -> GPU2 copy. Maybe even CPU -> GPU copy. We should follow the PyTorch design that all such copies must be explicit. However, this is problematic for the sequence lengths, exactly for the reasons stated above.

We might want to have two copies of sequence lengths, once on CPU and once on GPU, maybe dynamically on the first use?

Can RF actually support all different frameworks well when the underlying frameworks have very different semantics regarding the device handling? E.g. in TF, all that is usually automatic. At graph creation time, there is usually even no knowledge about the devices.

In RF, we can maybe follow one approach, e.g. the PyTorch approach which requires it all explicit, and try to fit it then to all the other frameworks. That could work.

albertz added a commit that referenced this issue Jul 18, 2023
@albertz
Copy link
Member Author

albertz commented Oct 21, 2023

We now have some basic support for device logic, and so far, this seems to be enough. I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant