Tips for utilizing GPU with expected hypervolume improvement? #985

nathanohara · 2021-11-18T01:16:09Z

nathanohara
Nov 18, 2021

Hi everyone,

I am working on a BO loop for an objective function that has 5 inputs and 3 outputs. I am using a multi-output SingleTaskGP, qExpectedHypervolumeImprovement, and a SobolQMCNormalSampler with 128 MC samples.

My code runs fine when using objective functions with less than 3 outputs, and also runs when I use a small number of MC samples, such as 32. However, on the 3-output objective function (and other objectives with larger m), I am consistently getting a RuntimeError: CUDA out of memory during the computation of qEHVI. Here's an example of an error message I'm getting, prompted by /botorch/acquisition/multi_objective/monte_carlo.py", line 284, in _compute_qehvi:

RuntimeError: CUDA out of memory. Tried to allocate 4.20 GiB (GPU 0; 15.78 GiB total capacity; 6.60 GiB already allocated; 4.17 GiB free; 10.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Changing the max split size and even disabling CUDA memory caching in PyTorch do not change this. It seems computing qEHVI requires a very large amount of free memory, given the creation of tensors like overlap_vertices which have size mc_samples x batch_shape x num_cells x q_choose_i x m .

Does anyone have tips for avoiding memory issues when using qEHVI in botorch? Is there anything I can do to alleviate this outside of adding resources? This is my first time applying GPU parallelization in a botorch loop. Thanks for any help.

Some relevant questions:

Is there a reason we need to load all MC samples in at once? Would it be possible to calculate the acquisition function / its gradients on a few samples at a time, cycling them out of GPU memory and storing their gradients, then later averaging to compute the final gradient of the MC acquisition function with respect to the candidates x? Seems this can be done using the batch_limit option in optimize_acqf
Is the computational graph potentially being cloned for each sample?

edit: also noticing this issue and going to try the suggestions of the users there, though it's not clear to me whether this is the cause of my issues. In some cases, the CUDA error arises only after several iterations of BO.

Additional information:

botorch==0.5.1
gpytorch==1.5.1
torch==1.10
1 x NVIDIA Tesla V100, Debian 10, cloud VM

Answered by Balandat

Nov 18, 2021

Hi, thanks for the inquiry.

Exact computation of HVI in the computation of EHVI can indeed be very memory intensive, as the time complexity for computing the hypervolume indicator itself scales super-polynomially with the number of objectives (see our paper for some discussion).

However, we have found that by using approximate box decompositions (essentially choosing the alpha parameter in the NonDominatedPartitioning greater than zero) can speed things up & reduce memory footprint significantly. Have you tried this?

Another way of reducing memory complexity would be to reduce the number of MC samples.

View full answer

Balandat · 2021-11-18T02:34:34Z

Balandat
Nov 18, 2021
Collaborator

Hi, thanks for the inquiry.

Exact computation of HVI in the computation of EHVI can indeed be very memory intensive, as the time complexity for computing the hypervolume indicator itself scales super-polynomially with the number of objectives (see our paper for some discussion).

However, we have found that by using approximate box decompositions (essentially choosing the alpha parameter in the NonDominatedPartitioning greater than zero) can speed things up & reduce memory footprint significantly. Have you tried this?

Another way of reducing memory complexity would be to reduce the number of MC samples.

1 reply

nathanohara Nov 19, 2021
Author

Thank you so much for the quick reply. This combined with sdaulton's suggestion to reduce the batch limit has solved my problem and allowed me to scale to large m.

sdaulton · 2021-11-18T04:11:15Z

sdaulton
Nov 18, 2021
Collaborator

What batch size are you using? Is this q=1? For q > 3 or so, you will have much lower memory overhead using qNoisyExpectedHypervolumeImprovement.

Couple other comments:

Use a ModelListGP rather than a BatchedMultiOutputGPyTorchModel to avoid unnecessarily materializing the .... x nt x nt covariance matrix where n is the number of test points and t is the number of outputs (this issue is noted here: [Bug] SingleTaskGP.posterior with multioutput evaluates the covar matrix #971).
If you are using qEHVI with exact box decompositions, use FastNonDominatedPartitioning, which is far more efficient in terms of wall time and the number of rectangles (memory) than NonDominatedPartitioning. If you want to use approximate decompositions, then use NonDominatedPartitioning with alpha>0.
Pass the "batch_limit" (5 or lower) as an option to optimize_acqf

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for utilizing GPU with expected hypervolume improvement? #985

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Tips for utilizing GPU with expected hypervolume improvement? #985

nathanohara Nov 18, 2021

Replies: 2 comments · 1 reply

Balandat Nov 18, 2021 Collaborator

nathanohara Nov 19, 2021 Author

sdaulton Nov 18, 2021 Collaborator

nathanohara
Nov 18, 2021

Replies: 2 comments 1 reply

Balandat
Nov 18, 2021
Collaborator

nathanohara Nov 19, 2021
Author

sdaulton
Nov 18, 2021
Collaborator