Lightweight BO in botorch #2273

ErikOrm · 2024-04-02T19:02:00Z

ErikOrm
Apr 2, 2024

Hi,

first of all, great work!

I am looking to run botorch in a lightweight setting with highly limited RAM capabilities and wonder if anyone has had similar thoughts. In principle, the baseline requirements to run BO are quite small, but run into two hiccups.

The first is torch by itself. Just importing torch and botorch is 350MB, though I suppose botorch does not require all this functionality? Has anyone found a reasonable way to run a lightweight version of torch with only the necessary functionality?

Secondly, I'm looking to evaluate the acquisition for a large number of points as part of a custom acq function optimizer. But calling model.posterior() naturally calculates the full mvn distribution (which uses a lot of RAM), where I would only need the mean and variance for every point. On the other hand, reshaping to have a large number of batches instead makes it very slow (and strangely doesn't use that little memory according to memory_profiler). Is there an easy way to just get the mean and variance without having to calculate the full posterior that I have missed?

Thanks alot!
Erik

Answered by saitcakmak

Apr 2, 2024

Hi @ErikOrm. This is an interesting setting that we haven't thought much about.

Just importing torch and botorch is 350MB

I briefly looked at this on my laptop. For me, Python process itself seems to register 6-7MB before any imports. import torch brings this up to 142MB. import botorch increases it further to 183MB. The precise numbers are not relevant, there's clearly a significant memory usage from just importing these packages.

I suppose botorch does not require all this functionality?

I do not know what gets loaded to memory on import torch. Presumably, this includes the core tensor functionality and relevant C++ code that enables many of the tensor operations we rely on. BoTorch…

View full answer

saitcakmak · 2024-04-02T21:18:42Z

saitcakmak
Apr 2, 2024
Collaborator

Hi @ErikOrm. This is an interesting setting that we haven't thought much about.

Just importing torch and botorch is 350MB

I briefly looked at this on my laptop. For me, Python process itself seems to register 6-7MB before any imports. import torch brings this up to 142MB. import botorch increases it further to 183MB. The precise numbers are not relevant, there's clearly a significant memory usage from just importing these packages.

I suppose botorch does not require all this functionality?

I do not know what gets loaded to memory on import torch. Presumably, this includes the core tensor functionality and relevant C++ code that enables many of the tensor operations we rely on. BoTorch, through GPyTorch, also relies on torch.distributions and torch.linalg modules. PyTorch comes packaged with many other things that we don't utilize. I don't know how much of this is loaded when importing torch / botorch, but if there's a limited build of PyTorch somewhere it could be worth experimenting with. You could also try to build it from source, though removing unused modules would likely require significant effort.

calling model.posterior() naturally calculates the full mvn distribution (which uses a lot of RAM), where I would only need the mean and variance for every point. On the other hand, reshaping to have a large number of batches instead makes it very slow (and strangely doesn't use that little memory according to memory_profiler). Is there an easy way to just get the mean and variance without having to calculate the full posterior that I have missed?

Reshaping the inputs to be batch x 1 x d (rather than batch x q x d) is a natural way to skip the off-diagonal entries. Without this, I don't think there's an easy way to limit the posterior covariance computations to the diagonal entries out of the box. The exact GP posterior inference requires computing the joint prior of training & test points -- which limits the minimum memory usage. You could limit the remaining updates to get the diagonal only, though this would likely require implementing a custom prediction strategy (and may end up being the same with reshaping approach).

It is hard to say much about how the reshaping would affect the runtime & memory usage without some profiling. In general, the effects will depend on the input sizes. If you have relatively small training data and you're jointly predicting a small number of points (q), you may not see any benefits. On the other hand, if you have large training data and you're predicting on many points, you can significantly reduce the memory usage by reshaping and evaluating the test points in mini batches. Here's an example that shows significant speed up from reshaping when the number of points is rather large. Memory profiling is a bit less reliable with gpytorch, so I skipped it here.

import torch
from botorch.models import SingleTaskGP

torch.set_default_dtype(torch.float64)
model = SingleTaskGP(torch.rand(20, 2), torch.rand(20, 1)).eval()
test_X = torch.rand(32, 512, 2)

model.posterior(test_X).variance  # 27.7ms
model.posterior(test_X.view(-1, 128, 2)).variance  # 7.9ms
model.posterior(test_X.view(-1, 32, 2)).variance  # 4.5ms
model.posterior(test_X.view(-1, 8, 2)).variance  # 3.8ms
model.posterior(test_X.view(-1, 1, 2)).variance  # 3.4ms

# With more train data
model = SingleTaskGP(torch.rand(256, 2), torch.rand(256, 1)).eval()

model.posterior(test_X).variance  # 59.6 ms
model.posterior(test_X.view(-1, 128, 2)).variance  # 24.5ms
model.posterior(test_X.view(-1, 32, 2)).variance  # 19.3ms
model.posterior(test_X.view(-1, 8, 2)).variance  # 20.3ms
model.posterior(test_X.view(-1, 1, 2)).variance  # 41.6ms

As you can see, the effect is dependent on the train and test data sizes. The largest tensors we work with will be the train-test covariances, which have shape batch x (num_train+q) x (num_train+q). There's benefit to reducing q (num_test) since it has a quadratic / cubic effect on memory / runtime, but then we pay the cost by having to repeat the same operations across many more batches.

7 replies

Balandat Apr 3, 2024
Collaborator

@ErikOrm I'm curious about your use case, could you describe a bit more why you need to run BO in this lightweight setting with limited RAM? I'm familiar with cases in which one would want to make predictions from a (previously fit) GP in such a setting (e.g. in the context of robotics or control) but less so with running a BO loop.

saitcakmak Apr 8, 2024
Collaborator

I maybe should say that I have observed a memory leak in Torch when running the minibatch-approach you mention. I would believe that it creates dangling connections in the autograd-tree, and they disappear if it is run with auto_grad turned off (if you would for example optimize the acq function over a discrete domain)

Can you share a repro for this? GPyTorch caching can lead to an appearance of a memory leak by messing with garbage collector reference counting, though we have also observed real memory leaks in the past. cc @esantorella

ErikOrm Apr 8, 2024
Author

Hi @Balandat, this is a great question. We are mainly looking at situations where you would run BO in various embedded systems for now. Offline learning for robots would be such an example. Of course, the setting where the optimization is decoupled from the system under consideration, where resources are abundant, would be much more common. Yet, it seems to me that with some engineering effort one might be able to create a lightweight version of a SingleTaskGP that doesn’t sacrifice in terms of performance, only in terms of flexibility. Do you think such a thing might be interesting for the broader community?

ErikOrm Apr 8, 2024
Author

Thanks for your help @saitcakmak!

I realize now I might have been wrong calling it a memory leak. Here is a simple code example where you see the behaviour:

from memory_profiler import profile
import torch
import botorch
import gpytorch

@profile
def test():
	x = torch.rand(30, 4, dtype=torch.float64)
	y = torch.sin(x).sum(dim=-1) + 0.1 * torch.randn(30) + torch.sum(x ** 2, dim=-1)
	y = y.view(-1, 1).to(dtype=torch.float64)
	y = (y - y.mean()) / y.std()

	model = botorch.models.SingleTaskGP(x, y)
	mll = gpytorch.mlls.ExactMarginalLogLikelihood(model.likelihood, model)
	botorch.fit_gpytorch_mll(mll)

	means = []
	for i in range(10000):
		x = torch.rand(1, 4, dtype=torch.float64)
		a = model.posterior(x)
		means.append(a.mean)
  
  if __name__ == "__main__":
	  test()

What I think happens is that storing the means keeps the full covariance in memory in the autograd tree somehow. In either case, this uses significantly more RAM than just calculating the full posterior for the 10k points. And if we would detach it and instead store a.mean.item(), it instead uses close to 0 memory.

saitcakmak Apr 9, 2024
Collaborator

What I think happens is that storing the means keeps the full covariance in memory in the autograd tree somehow

I think this is exactly what is happening. When mean has grad_fn attached, it keeps the autograd tree in memory for the backwards pass. That includes all tensors that are required to compute the mean. Once you detach the grad, they can be collected, so you end up with very little memory usage. If you don't need the gradients, running the computations in with torch.no_grad(): context will skip the gradient computations while running the code & may reduce the memory usage further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightweight BO in botorch #2273

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lightweight BO in botorch #2273

ErikOrm Apr 2, 2024

Replies: 1 comment · 7 replies

saitcakmak Apr 2, 2024 Collaborator

Balandat Apr 3, 2024 Collaborator

saitcakmak Apr 8, 2024 Collaborator

ErikOrm Apr 8, 2024 Author

ErikOrm Apr 8, 2024 Author

saitcakmak Apr 9, 2024 Collaborator

ErikOrm
Apr 2, 2024

Replies: 1 comment 7 replies

saitcakmak
Apr 2, 2024
Collaborator

Balandat Apr 3, 2024
Collaborator

saitcakmak Apr 8, 2024
Collaborator

ErikOrm Apr 8, 2024
Author

ErikOrm Apr 8, 2024
Author

saitcakmak Apr 9, 2024
Collaborator