-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: How to deal with devices #1331
Comments
I realize now that most of the current code was never actually tested whether it works on GPU. It turns out, there are many problems here and there where some part is on GPU and another is on GPU, and then PyTorch complains. Not always though. There are a number of PyTorch functions which would not complain. E.g.:
We also naturally have a mixup due to the seq len, which we have on CPU, as a number of ops require that in any case (which exactly?). So then when we want to create a mask, the mask is actually often required on GPU. E.g. for torch.where(mask, tensor.raw_tensor, -inf_value) I think But at what point should we have the CPU->GPU transfer? Internally it uses
Inside |
Currently I think a context manager might make sense also on RF level, very much like |
When PyTorch is the backend, we want to avoid any automatic GPU -> CPU copy, or GPU1 -> GPU2 copy. Maybe even CPU -> GPU copy. We should follow the PyTorch design that all such copies must be explicit. However, this is problematic for the sequence lengths, exactly for the reasons stated above. We might want to have two copies of sequence lengths, once on CPU and once on GPU, maybe dynamically on the first use? Can RF actually support all different frameworks well when the underlying frameworks have very different semantics regarding the device handling? E.g. in TF, all that is usually automatic. At graph creation time, there is usually even no knowledge about the devices. In RF, we can maybe follow one approach, e.g. the PyTorch approach which requires it all explicit, and try to fit it then to all the other frameworks. That could work. |
We now have some basic support for device logic, and so far, this seems to be enough. I'm closing this. |
Currently we don't really deal with devices at all in the RETURNN frontend (RF) (#1120). This assumes that we can automatically figure it out. Usually we would perform most computations on GPU if possible.
This issue here is just to keep track of this aspect. It's not really clear yet what we should do about it.
TensorFlow provides a context manager to control the placement of ops on a specific device. And otherwise it mostly tries to automatically move as much as possible to GPU. The computation graph is usually totally independent of any device information. TensorFlow automatically transfers a tensor from one device to another.
PyTorch has a
device
argument for most ops. When creating tensors initially, you have to choose explicitly where they are placed. Then functions liketo
(also can be applied for a whole model) are used to move tensors to a different device. This has to be done explicitly. PyTorch >=2.0 also hastorch.set_default_device
andtorch.device
can also be used as context manager now.Before that, PyTorch had more CUDA specific logic, e.g.
torch.cuda.device
. See here.PyTorch does not really have the concept of GPU device, but it is all CUDA specific, so it is the "CUDA device", and then it also supports other backends, e.g. MPS. This is different from TF where it would be the GPU device, no matter if this is CUDA or MPS or sth else.
Instead of adding a
device
argument to all of our functions, we probably also want to have such a global setting, and maybe also context manager. But it's not totally clear how easy we can do it in a way that it works for all backends (e.g. TF and PyTorch).The interaction with multi GPU training (DDP etc) is also relevant.
The text was updated successfully, but these errors were encountered: