Improving Consistency Across Optimization Loops #2110

AndrewFalkowski · 2023-11-16T00:07:25Z

AndrewFalkowski
Nov 16, 2023

In playing around with BoTorch, I've found a concerning amount of variability between optimization runs from the same initial set of data points. The results of my test problem would suggest that the random seed has substantial influence over success of the optimization run. I'm still quite new to this, and it is likely that there is an implementation error on my end, but I wanted to reach out to the community and see if there is something I am missing or some way that I can improve. I've included as much information as possible below to hopefully expose an error in my methods.

TLDR: When variation between optimization loops is limited to random seed selection, is variation in optimizer performance then a function of poor surrogate model fitting, poor acqf optimization, or a little of both?

Starting with a simple Hartmann6 optimization

I've implemented a simple ensemble optimization loop in the code section below:

import torch
from botorch.test_functions import Hartmann
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from gpytorch.mlls import ExactMarginalLogLikelihood
from botorch.optim import optimize_acqf
from botorch.acquisition import UpperConfidenceBound
from botorch.models.transforms.outcome import Standardize
from botorch.models.transforms.input import InputStandardize
import numpy as np
import random
from tqdm.auto import trange

def set_seeds(seed=42):
    """set all library random seeds"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

def random_sampler(n=2, dims=3):
    """Randomly sample the design space."""
    sample = torch.rand(n, dims, dtype=torch.float64)
    return sample
    
# list of seeds to use for each run
seed_list = [4295, 8508, 326, 3135, 1549, 2528, 1274, 6545, 5971, 6269, 2422, 4287]

ensemble_count = 10 # how many complete optimization loops to run
n_init = 5 # how many samples to start with
budget = 25 # how many optimization iterations to perform per loop

ensemble_best_to_trial = []
ensemble_y = []

set_seeds(10) # keep constant seed for out-of-loop initialization
hart = Hartmann(dim=6, bounds=[(0,1)]*6, negate=True)

X_ = random_sampler(n=n_init, dims=6)
y_ = hart(X=X_[:,:])[:,None]

for i in (pbar0 := trange(ensemble_count, leave=True, colour="red")):

    set_seeds(seed_list[i]) # set random seeds for optimizers
    X = X_.clone()
    y = y_.clone()
    for j in (pbar1 := trange(budget, leave=True)):

        # build the gp model
        gp = SingleTaskGP(
            train_X=X,
            train_Y=y,
            input_transform = InputStandardize(d=X.shape[-1]), # standardize X
            outcome_transform = Standardize(m=y.shape[-1]), # standardize y
        )


        # fit the model by maximizing the likelihood.
        mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
        fit_gpytorch_mll(mll)

        # set bounds for acqf
        bounds = torch.stack([torch.zeros(6), torch.ones(6)])

        X_new, acq_value = optimize_acqf(
            acq_function= UpperConfidenceBound(gp, beta=2),
            bounds=bounds,
            q=1, # how many new points to generate
            num_restarts = 5, # how many times to restart the optimizer
            raw_samples = 20 # how many initial points to sample acqf space from
        )

        X = torch.cat([X, X_new])
        y = torch.cat([y, hart(X=X_new)[:,None]])

        pbar0.set_description(f"Processing Trial {j+1} of Loop {i+1}")
        pbar1.set_description(f"Loop {i+1} | Best Value {round(torch.max(y).item(),3)}")

    best_to_trial = torch.zeros_like(y)
    for i in range(len(best_to_trial)):
        best_to_trial[i] = torch.max(y.flatten()[:i+1]).item()

    ensemble_best_to_trial.append(best_to_trial.flatten().numpy())
    ensemble_y.append(y.flatten().numpy())

Running this script, I get the following results, which show a difference between the maximum and minimum optimized value over ten trials of 0.657 or ~20% of the range of the objective function.

Loop 1 | Best Value 2.784: 100%|██████████| 25/25 [00:03<00:00,  6.41it/s]t/s]
Loop 2 | Best Value 2.528: 100%|██████████| 25/25 [00:03<00:00,  6.92it/s],  3.90s/it]
Loop 3 | Best Value 2.583: 100%|██████████| 25/25 [00:03<00:00,  7.01it/s],  3.73s/it]
Loop 4 | Best Value 2.692: 100%|██████████| 25/25 [00:03<00:00,  6.58it/s],  3.66s/it]
Loop 5 | Best Value 2.485: 100%|██████████| 25/25 [00:03<00:00,  7.17it/s],  3.71s/it]
Loop 6 | Best Value 2.127: 100%|██████████| 25/25 [00:03<00:00,  7.10it/s],  3.63s/it]
Loop 7 | Best Value 2.141: 100%|██████████| 25/25 [00:04<00:00,  6.19it/s],  3.60s/it]
Loop 8 | Best Value 2.265: 100%|██████████| 25/25 [00:03<00:00,  7.23it/s],  3.74s/it]
Loop 9 | Best Value 2.146: 100%|██████████| 25/25 [00:03<00:00,  7.26it/s],  3.65s/it]
Loop 10 | Best Value 2.14: 100%|██████████| 25/25 [00:02<00:00,  8.87it/s]3,  3.59s/it]

I would expect that increasing the num_restarts and raw_samples would improve this, but the effect seems marginal (0.657 vs. 0.624) as shown in the figure below. I will note that in playing around I have found some combinations to perform better than others at times, but I have been unable to reproduce these results for this post.

Categorical Hartmann Problem Accentuates Discrepancies

The discrepancy between repeat trials seems to scale with problem complexity as well. Below I have built a categorical Hartmann3/6 problem wherein each category value encodes an offset Hartmann3/6 objective function. Running with a mixedGP and acqf_optimizer shows a disparity in performance between optimization loops.

class CategoricalHartmann():
    """Class for creating a synthetic Hartmann function with a categorical variable."""
    def __init__(self, levels : int = 3, dims : int = 3, seed : int = 42):
        self.levels = levels
        # assert dims in (3, 6), "dims must be either 3 or 6"
        self.dims = dims
        
        # set the random seeds
        set_seeds(seed)

        # set the intercepts
        self.level_intercepts = torch.tensor(np.random.permutation(np.arange(0,levels/1.5, 0.5))[:levels])
        # set the coefficients
        self.level_slopes = torch.tensor(np.random.permutation(np.linspace(1,2.5,20))[:levels])

        # get the standard hartmann function
        self.obj = Hartmann(dim=dims, bounds=[(0,1)]*dims, negate=True)

        # get optimal params
        self.optimal_params = self.get_max()

    def get_max(self):
        self.optima = torch.zeros(self.levels)
        if self.dims == 3:
            for i in range(self.levels):
                self.optima[i] = self.obj(torch.tensor([0.114614, 0.555649, 0.852547]))*self.level_slopes[i] + self.level_intercepts[i]
        else:
            for i in range(self.levels):
                self.optima[i] = self.obj(torch.tensor([0.20169, 0.150011, 0.476874, 0.275332, 0.311652, 0.6573]))*self.level_slopes[i] + self.level_intercepts[i]
        
        return torch.argmax(self.optima).item()

    def __call__(self, cat : int, X : List[float]):
        # assert cat < self.levels, f"Invalid Category {cat}"
        return self.obj(X)*self.level_slopes[cat.to(torch.int)] + self.level_intercepts[cat.to(torch.int)]
    
    def __repr__(self):
            # Define the string representation of the object
            return f"\nCategoricalHartmann\n    Categories = {self.levels}\n    Continuous Dims = {self.dims}\n    LevelIntercepts = {self.level_intercepts.tolist()}\n    LevelSlopes = {self.level_slopes.tolist()}\n    Optimum: CAT {self.optimal_params} -> {torch.max(self.optima)}\n"

n_loops = 10
n_init = 5
budget = 20

cats = 3 # number of categories
dims = 6 # number of continuous dimensions

ensemble_best_to_trial = []
ensemble_y = []
ensemble_cat = []

# initialize the obj func and the initial points outside of loop to keep constant
cathart = CategoricalHartmann(levels=cats, dims=dims, seed=121)
print(cathart)

X_ = random_sampler(n=n_init, levels=cats, dims=dims) # deliberately seeking bad samples
y_ = cathart(cat=X_[:,0], X=X_[:,1:])[:,None]

for i in (pbar0 := trange(n_loops, leave=True, colour="red")):

    set_seeds(seed_list[i]) # set random seeds for optimizers
    X = X_.clone()
    y = y_.clone()

    for j in (pbar1 := trange(budget)):
        gp = MixedSingleTaskGP(
            train_X=X,
            train_Y=y,
            cat_dims=[0],
            input_transform = InputStandardize(d=X.shape[-1],
                                               indices=list(np.arange(0, dims, 1)+1)), # standardize X
            outcome_transform = Standardize(m=y.shape[-1]), # standardize y
        )

        mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
        fit_gpytorch_mll(mll)

        bounds = torch.stack([torch.zeros(dims+1), torch.ones(dims+1)])
        bounds[0,0] = 0.0
        bounds[1,0] = float(cats-1)

        X_new, acq_value = optimize_acqf_mixed(
            acq_function = UpperConfidenceBound(gp, beta=2),
            bounds=bounds, 
            fixed_features_list=[{0:i} for i in range(cats)],
            q=1,
            num_restarts=20,
            raw_samples=512,
        )
        
        X = torch.cat([X, X_new])
        y = torch.cat([y, cathart(cat=X_new[:,0], X=X_new[:,1:])[:,None]])

        pbar0.set_description(f"Processing Trial {j+1} of Loop {i+1}")
        pbar1.set_description(f"Loop {i+1} | Best Value {round(torch.max(y).item(),3)}")

    best_to_trial = torch.zeros_like(y)
    for i in range(len(best_to_trial)):
        best_to_trial[i] = torch.max(y.flatten()[:i+1]).item()

    ensemble_best_to_trial.append(best_to_trial.flatten().numpy())
    ensemble_y.append(y.flatten().numpy())
    ensemble_cat.append(X.numpy())

The results show a large spread in "optimized" values despite having similar initial conditions. Loop 4 is a clear outlier, but beyond this point the results suggest quite a bit of uncertainty in the performance of the optimization loop.

CategoricalHartmann
    Categories = 3
    Continuous Dims = 6
    LevelIntercepts = [1.5, 0.0, 0.5]
    LevelSlopes = [1.7894736842105263, 2.263157894736842, 2.4210526315789473]
    Optimum: CAT 2 -> 8.543627738952637

Loop 1 | Best Value 8.065: 100%|██████████| 20/20 [00:24<00:00,  1.21s/it]t/s]
Loop 2 | Best Value 6.243: 100%|██████████| 20/20 [00:24<00:00,  1.24s/it], 24.17s/it]
Loop 3 | Best Value 6.581: 100%|██████████| 20/20 [00:13<00:00,  1.43it/s], 24.57s/it]
Loop 4 | Best Value 3.633: 100%|██████████| 20/20 [00:10<00:00,  1.86it/s], 19.73s/it]
Loop 5 | Best Value 8.004: 100%|██████████| 20/20 [00:25<00:00,  1.28s/it], 16.19s/it]
Loop 6 | Best Value 8.011: 100%|██████████| 20/20 [00:23<00:00,  1.17s/it], 19.61s/it]
Loop 7 | Best Value 7.687: 100%|██████████| 20/20 [00:20<00:00,  1.04s/it], 20.91s/it]
Loop 8 | Best Value 8.173: 100%|██████████| 20/20 [00:30<00:00,  1.54s/it], 20.85s/it]
Loop 9 | Best Value 6.539: 100%|██████████| 20/20 [00:25<00:00,  1.29s/it], 24.01s/it]
Loop 10 | Best Value 5.173:100%|██████████| 20/20 [00:23<00:00,  1.15s/it], 24.58s/it]

Visualizing these results further highlights the discrepancy:

Given that all optimization loops start with the same data and only vary by the random seed given to the optimizer, where does the variation come from and how can it be controlled or reduced? Part of me suspects that this is due to variations in surrogate model fitting within fit_gpytorch_mll, but I haven't explored this thoroughly yet. Looking specifically at sample point selections, it seems that deviation starts almost immediately for most variables. Below I have plotted the selected variable values across each trial and and optimization loop. The top-most plot is the categorical variable selection, the continuous variables follow.

Hopefully, I've communicated this effectively. Happy to clarify where necessary!

Balandat · 2023-11-16T14:44:24Z

Balandat
Nov 16, 2023
Collaborator

Thanks for this thorough study, this is quite interesting. Since this is a relatively low dimensional search space my gut feeling is that acquisition function optimization is somewhat unlikely to be the main contributor here, so I think you're right that the next step here would be to look into how much variability there is in the model fitting. One thing to note is that you're using InputStandardize as the transform - we typically recommend people use Normalize as the default priors on the model are optimized for input data living in the unit cube. It seems like this shouldn't make a massive difference, but I'm curious whether this has some effect on the variability of the model fitting procedure.

cc @esantorella and @saitcakmak who have looked into reproducibility of the optimization before.

3 replies

AndrewFalkowski Nov 16, 2023
Author

@Balandat Good catch on using Normalize, I am finding the the algorithm performs considerably better and seems to be a little more consistent with the final outcome as well!

However I am still a little confused as to the effect of num_restarts and raw_samples trying several combinations seems to yield different results in terms of variability between batches. I deal with physical experiments where each trial is considerably more expensive compared the acqf evaluation, choosing large values for raw_samples and num_restarts would seem like a good approach, but as seen in the figure below there seems to be local optima within these two variables.

Applying this then to my categorical hartmann6 problem, I see some performance gains, but still see quite a lot of variability across optimization loops. In my current implementation, I do not apply the normalization transform to the categorical var as I am unsure how to dynamically update the fixed var constraints to account for yet unsampled categories. Is this likely to have an impact as well?

Also find inferior performance when increasing raw_samples on this problem.

Balandat Nov 17, 2023
Collaborator

choosing large values for raw_samples and num_restarts would seem like a good approach, but as seen in the figure below there seems to be local optima within these two variables.

I believe this may have to do with how the default initialization strategy behaves for very large raw_samples. @SebastianAment has observed some similar behavior in the past, he can probably explain better what might be going on.

I do not apply the normalization transform to the categorical var as I am unsure how to dynamically update the fixed var constraints to account for yet unsampled categories. Is this likely to have an impact as well?

Probably not too much assuming the values of that categorical variable aren't way outside the unit cube. I haven't had the time to understand the exact detail of the categorical hartmann version that you have but the sampling and random permutations could be making the problem quite a bit harder.

Also find inferior performance when increasing raw_samples on this problem.

Squinting at the plots I believe that but there is still quite a bit of variance in the results. What do the confidence intervals around those mean lines look like? You'll probably be able to find some statistical significant difference if you run enough replicates, but right now things are still pretty noisy so I would be careful not to draw premature conclusions (at least about the severity of the issue).

AndrewFalkowski Nov 20, 2023
Author

@SebastianAment has observed some similar behavior in the past, he can probably explain better what might be going on.

I'm curious about his findings and his recommendations for optimization parameters. Running the two on Hartmann6 for 25 optimizations shows a difference in variability with a lower raw_samples value seemingly providing more consistency across random seeds. Average performance difference seems relatively comparable, but non-trivial.

Probably not too much assuming the values of that categorical variable aren't way outside the unit cube.

Yeah normalization didn't result in a change in performance on my end. I wonder if there is a more suitable diagnostic synthetic function that I could use? My implementation of CategoricalHartmann is intended to mimic a number of the problems I run into in my field wherein categorical variables are non-ordinal and produce shifts in response analogous to an intercept or slope change. Is there some assumption within optimize_acqf_mixed that would make it inappropriate for such problems?

At some point I am pushing against some inherent stochasticity, but surely there ought to be a better policy for ensuring some consistency in mixed variable problems?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Consistency Across Optimization Loops #2110

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improving Consistency Across Optimization Loops #2110

AndrewFalkowski Nov 16, 2023

Starting with a simple Hartmann6 optimization

Categorical Hartmann Problem Accentuates Discrepancies

Replies: 1 comment · 3 replies

Balandat Nov 16, 2023 Collaborator

AndrewFalkowski Nov 16, 2023 Author

Balandat Nov 17, 2023 Collaborator

AndrewFalkowski Nov 20, 2023 Author

AndrewFalkowski
Nov 16, 2023

Replies: 1 comment 3 replies

Balandat
Nov 16, 2023
Collaborator

AndrewFalkowski Nov 16, 2023
Author

Balandat Nov 17, 2023
Collaborator

AndrewFalkowski Nov 20, 2023
Author