Cuda Memory Overflow in Jacobian Computation #1058

kayween · 2022-10-31T22:29:07Z

Hi,

I implemented a Jacobian computation using functorch, but encoutnered a memory overflow issue.

The function that I want to differentiate is ResidualFunctional.residual. I'd like to compute the Jacobian of this function w.r.t. its first argument inputs.

The output of ResidualFunctional.residual is a tensor of size (10000, ) and inputs is a tensor of size (1001, ). Thus, the Jacobian is 10000 by 1001, which takes about 74 MB using double precision.

However, functorch.jacrev had a memory overflow error on a 24 GB GPU. The error message is shown below. I am wondering why FuncTorch takes so much memory in the reverse mode autodiff, and if there is a solution to this issue.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 38.00 GiB (GPU 0; 23.69 GiB total capacity; 810.80 MiB already allocated; 21.25 GiB free; 824.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Below is a working example that reproduce this issue.

CUDA 11.4
FuncTorch 1.13.0
PyTorch 1.13.0
GPyTorch 1.9.0

Thanks!

import torch
import gpytorch

import functorch
from functorch import make_functional_with_buffers


class ResidualFunctional():
    def __init__(self,
        kernel, m, d,
        outputscale=None, sigma=None,
        lengthscale_penalty=None
    ):
        self.func, _, self.buffers = make_functional_with_buffers(kernel)
        self.m = m
        self.d = d

        self.outputscale = outputscale
        self.sigma = sigma

    def _residual(self, u, x, y, params, sigma):
        with gpytorch.settings.trace_mode(), gpytorch.settings.lazily_evaluate_kernels(False):
            m = u.size(0)

            func_nl = lambda params, buffers, x1, x2: self.func(params, buffers, x1, x2).evaluate()

            Kxu = func_nl(params, self.buffers, x, u)
            A = torch.cat(
                [Kxu, sigma * torch.eye(m, device=u.device)],
                dim=-2,
            )
            ybar = torch.cat([y, y.new_zeros(m)], dim=-1)
            c = torch.linalg.lstsq(A, ybar.unsqueeze(-1), rcond=None).solution.squeeze()
            r = ybar - A @ c
            return r

    def residual(self, inputs, x, y):
        u = inputs[:self.m * self.d].view(self.m, self.d)

        lengthscale = torch.nn.functional.softplus(inputs[-1])

        return self._residual(u, x, y, (lengthscale, self.outputscale), self.sigma)


if __name__ == "__main__":
    device = "cuda:0"

    n = 10000
    d = 10
    m = 100

    u = torch.randn(m, d, device=device)
    x = torch.randn(n, d, device=device)
    y = torch.randn(n, device=device)

    kernel = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    kernel = kernel.to(device)

    functional = ResidualFunctional(
        kernel, m=m, d=d,
        outputscale=kernel.outputscale, sigma=1e-2,
    )

    inputs = torch.cat(
        (u.view(-1), kernel.base_kernel.raw_lengthscale.view(-1)),
        dim=-1
    )
    residual = functional.residual(inputs, x, y)
    print(residual.shape)

    jacobian = functorch.jacrev(functional.residual, argnums=0)(inputs, x, y)
    print(jacobian.shape)

The text was updated successfully, but these errors were encountered:

zou3519 · 2022-11-02T14:04:07Z

We've been planning a feature to let users control the "vectorization" factor of the jacobian computation (#680). At one extreme, one can compute the jacobian row-by-row. At the other extreme we can use vmap to turn the for-loop into a vectorized computation for more performance (at the cost of using more peak memory).

So there is a performance <-> memory tradeoff here. Today functorch.jacrev goes the vectorized extreme and torch.autograd.functional.jacobian is at the for-loop extreme. I'm curious -- does using torch.autograd.functional.jacobian instead resolve the high memory usage?

zou3519 · 2022-11-02T15:54:07Z

The output of ResidualFunctional.residual is a tensor of size (10000, ) and inputs is a tensor of size (1001, ). Thus, the Jacobian is 10000 by 1001, which takes about 74 MB using double precision.

If the output size is much greater than the input size, then it's likely that functorch.jacfwd will be more efficient. Have you tried running that instead of jacrev?

kayween · 2022-11-02T17:31:23Z

Yes, I have tried the functorch.jacfwd as well. But it does not solve the memory issue unfortunately :(

I understand that forward-mode autodiff is faster than reverse-mode, if the input size is smaller than the output size. But is forward-mode also more memory efficient?

The output of ResidualFunctional.residual is a tensor of size (10000, ) and inputs is a tensor of size (1001, ). Thus, the Jacobian is 10000 by 1001, which takes about 74 MB using double precision.

If the output size is much greater than the input size, then it's likely that functorch.jacfwd will be more efficient. Have you tried running that instead of jacrev?

kayween · 2022-11-02T17:34:08Z

Yeah, torch.autograd.functional.jacobian does work, but it is too slow. In fact, that is exactly why I was trying to get functorch working. I was hoping that functorch can compute the jacobian faster than torch.autograd.

We've been planning a feature to let users control the "vectorization" factor of the jacobian computation (#680). At one extreme, one can compute the jacobian row-by-row. At the other extreme we can use vmap to turn the for-loop into a vectorized computation for more performance (at the cost of using more peak memory).

So there is a performance <-> memory tradeoff here. Today functorch.jacrev goes the vectorized extreme and torch.autograd.functional.jacobian is at the for-loop extreme. I'm curious -- does using torch.autograd.functional.jacobian instead resolve the high memory usage?

kayween · 2022-11-02T17:55:48Z

I have a general question about automatic differentiation.

I have a code base that computes the jacobian of the above function manually (derive the math expression of the jacobian and type it in the code), and the "manual differentiaion" does not have memory issue on a 24 GB GPU.

Theoretically, does automatic differentiation have to cost more memory than manual differentiation when computing jacobian of vector functions? It looks like automatic differentiation needs to store all intermediate matrices and therefore might be more memory consuming?

zou3519 · 2022-11-08T14:53:53Z

Theoretically, does automatic differentiation have to cost more memory than manual differentiation when computing jacobian of vector functions? It looks like automatic differentiation needs to store all intermediate matrices and therefore might be more memory consuming?

It depends on what exactly manual differentiation is. But yes reverse-mode AD needs to store intermediates and this will increase the memory usage.

wiseodd mentioned this issue May 16, 2024

torch.func jacobian OOM aleximmer/Laplace#186

Closed

dezenn mentioned this issue Aug 8, 2024

Strange behaviour of autograd.functional.jacobian when vectorize=True and strategy=‘forward-mode’ #1146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda Memory Overflow in Jacobian Computation #1058

Cuda Memory Overflow in Jacobian Computation #1058

kayween commented Oct 31, 2022 •

edited

Loading

zou3519 commented Nov 2, 2022 •

edited

Loading

zou3519 commented Nov 2, 2022

kayween commented Nov 2, 2022

kayween commented Nov 2, 2022 •

edited

Loading

kayween commented Nov 2, 2022

zou3519 commented Nov 8, 2022

Cuda Memory Overflow in Jacobian Computation #1058

Cuda Memory Overflow in Jacobian Computation #1058

Comments

kayween commented Oct 31, 2022 • edited Loading

zou3519 commented Nov 2, 2022 • edited Loading

zou3519 commented Nov 2, 2022

kayween commented Nov 2, 2022

kayween commented Nov 2, 2022 • edited Loading

kayween commented Nov 2, 2022

zou3519 commented Nov 8, 2022

kayween commented Oct 31, 2022 •

edited

Loading

zou3519 commented Nov 2, 2022 •

edited

Loading

kayween commented Nov 2, 2022 •

edited

Loading