Truncated proposals for SNPE (TSNPE) implementation #1354

ali-akhavan89 · 2025-01-08T22:46:35Z

ali-akhavan89
Jan 8, 2025

Hello,

I have a technical implementation question about truncated proposal for SNPE using the following code from FAQ:

from sbi.inference import NPE
from sbi.utils import RestrictedPrior, get_density_thresholder

inference = NPE(prior)
proposal = prior
for _ in range(num_rounds):
    theta = proposal.sample((num_sims,))
    x = simulator(theta)
    _ = inference.append_simulations(theta, x).train(force_first_round_loss=True)
    posterior = inference.build_posterior().set_default_x(x_o)

    accept_reject_fn = get_density_thresholder(posterior, quantile=1e-4)
    proposal = RestrictedPrior(prior, accept_reject_fn, sample_with="rejection")

I have implemented the structure above in my workflow. Since I'm using GPU for training, if I do not modify the implementation above, Drawing 1000000 posterior samples goes through without any issues, but eventually it throws me an CUDA out-of-memory error. Notably, I see that even RAM and CPU usage also goes up by a lot.

I am not sure why the script circles back to our custom defined embedding networks (maybe, that could be the cause of this memory issue, but we're unsure):

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[1], line 118
    115 warnings.filterwarnings("ignore", category=UserWarning)
    116 vensimInputs['device'] = str(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
--> 118 posterior = run_training(vensimInputs, SBI_Inputs, resInputs)  

File D:\Ali\validation_v07\setup_functions\run_training.py:226, in run_training(vensimInputs, SBI_Inputs, resInputs)
    217 posterior = inference.build_posterior(density_estimator, sample_with='direct').set_default_x(x_o) # ali: added sampling ...
    218 #posterior = inference.build_posterior(density_estimator, sample_with='direct').set_default_x(x_o.cpu())
    219 #posterior = inference.build_posterior(density_estimator, sample_with='mcmc') # ali: added sampling ...
    220 #posterior = inference.build_posterior(density_estimator) # ali: testing different sampling_with methods
   (...)
    224 #posterior._device = torch.device("cpu")
    225 #posterior.posterior_estimator = posterior.posterior_estimator.cpu()
--> 226 accept_reject_fn = get_density_thresholder(posterior, quantile=1e-4)
    227 proposal = RestrictedPrior(prior, accept_reject_fn, sample_with="rejection")
    228 posteriors.append(posterior)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\utils\restriction_estimator.py:507, in get_density_thresholder(dist, quantile, num_samples_to_estimate_support)
    485 """Returns function that thresholds a density at a particular `1-quantile`.
    486 
    487 Reference:
   (...)
    503     of the `dist`.
    504 """
    506 samples = dist.sample((num_samples_to_estimate_support,))
--> 507 log_probs = dist.log_prob(samples)
    508 sorted_log_probs, _ = torch.sort(log_probs)
    509 log_prob_threshold = sorted_log_probs[
    510     int(quantile * num_samples_to_estimate_support)
    511 ]

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\inference\posteriors\direct_posterior.py:244, in DirectPosterior.log_prob(self, theta, x, norm_posterior, track_gradients, leakage_correction_params)
    240 self.posterior_estimator.eval()
    242 with torch.set_grad_enabled(track_gradients):
    243     # Evaluate on device, move back to cpu for comparison with prior.
--> 244     unnorm_log_prob = self.posterior_estimator.log_prob(
    245         theta_density_estimator, condition=x_density_estimator
    246     )
    247     # `log_prob` supports only a single observation (i.e. `batchsize==1`).
    248     # We now remove this additional dimension.
    249     unnorm_log_prob = unnorm_log_prob.squeeze(dim=1)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\neural_nets\estimators\nflows_flow.py:109, in NFlowsFlow.log_prob(self, input, condition)
    106 ones_for_event_dims = (1,) * condition_event_dims  # Tuple of 1s, e.g. (1, 1, 1)
    107 condition = condition.repeat(input_sample_dim, *ones_for_event_dims)
--> 109 log_probs = self.net.log_prob(input, context=condition)
    110 return log_probs.reshape((input_sample_dim, input_batch_dim))

File ~\.conda\envs\ptgpu\Lib\site-packages\nflows\distributions\base.py:40, in Distribution.log_prob(self, inputs, context)
     36     if inputs.shape[0] != context.shape[0]:
     37         raise ValueError(
     38             "Number of input items must be equal to number of context items."
     39         )
---> 40 return self._log_prob(inputs, context)

File ~\.conda\envs\ptgpu\Lib\site-packages\nflows\flows\base.py:38, in Flow._log_prob(self, inputs, context)
     37 def _log_prob(self, inputs, context):
---> 38     embedded_context = self._embedding_net(context)
     39     noise, logabsdet = self._transform(inputs, context=embedded_context)
     40     log_prob = self._distribution.log_prob(noise, context=embedded_context)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\container.py:250, in Sequential.forward(self, input)
    248 def forward(self, input):
    249     for module in self:
--> 250         input = module(input)
    251     return input

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File D:\Ali\validation_v07\setup_functions\classes.py:130, in SequenceNetwork.forward(self, x)
    127 x_time_series = x_time_series.permute(0, 2, 1)  # Shape: (batch_size, n_series, time_series_length)
    129 for conv_layer in self.conv_layers:
--> 130     x_time_series = conv_layer(x_time_series)
    132 x_time_series = x_time_series.permute(0, 2, 1)  # Shape: (batch_size, time_steps, conv_output_channels)
    133 self.lstm.flatten_parameters()  # Optimize LSTM performance on GPU

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File D:\Ali\validation_v07\setup_functions\classes.py:63, in MultiConv1D.forward(self, x)
     61 min_length = min([output.shape[2] for output in conv_outputs])
     62 conv_outputs = [output[:, :, :min_length] for output in conv_outputs]
---> 63 x = torch.cat(conv_outputs, dim=1)
     64 return x

OutOfMemoryError: CUDA out of memory. Tried to allocate 35.76 GiB. GPU 0 has a total capacity of 11.99 GiB of which 0 bytes is free. Of the allocated memory 37.05 GiB is allocated by PyTorch, and 60.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

However, when I modify my code to move posterior_estimator to CPU, I get a device mismatch error that I cannot easily trace as it goes through different classes of SBI:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 118
    115 warnings.filterwarnings("ignore", category=UserWarning)
    116 vensimInputs['device'] = str(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
--> 118 posterior = run_training(vensimInputs, SBI_Inputs, resInputs)  

File D:\Ali\validation_v07\setup_functions\run_training.py:226, in run_training(vensimInputs, SBI_Inputs, resInputs)
    224 posterior._device = torch.device("cpu")
    225 posterior.posterior_estimator = posterior.posterior_estimator.cpu()
--> 226 accept_reject_fn = get_density_thresholder(posterior, quantile=1e-4)
    227 proposal = RestrictedPrior(prior, accept_reject_fn, sample_with="rejection")
    228 posteriors.append(posterior)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\utils\restriction_estimator.py:506, in get_density_thresholder(dist, quantile, num_samples_to_estimate_support)
    480 def get_density_thresholder(
    481     dist: Any,
    482     quantile: float = 1e-4,
    483     num_samples_to_estimate_support: int = 1_000_000,
    484 ) -> Callable:
    485     """Returns function that thresholds a density at a particular `1-quantile`.
    486 
    487     Reference:
   (...)
    503         of the `dist`.
    504     """
--> 506     samples = dist.sample((num_samples_to_estimate_support,))
    507     log_probs = dist.log_prob(samples)
    508     sorted_log_probs, _ = torch.sort(log_probs)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\inference\posteriors\direct_posterior.py:134, in DirectPosterior.sample(self, sample_shape, x, max_sampling_batch_size, sample_with, show_progress_bars)
    127 if sample_with is not None:
    128     raise ValueError(
    129         f"You set `sample_with={sample_with}`. As of sbi v0.18.0, setting "
    130         f"`sample_with` is no longer supported. You have to rerun "
    131         f"`.build_posterior(sample_with={sample_with}).`"
    132     )
--> 134 samples = rejection.accept_reject_sample(
    135     proposal=self.posterior_estimator,
    136     accept_reject_fn=lambda theta: within_support(self.prior, theta),
    137     num_samples=num_samples,
    138     show_progress_bars=show_progress_bars,
    139     max_sampling_batch_size=max_sampling_batch_size,
    140     proposal_sampling_kwargs={"condition": x},
    141     alternative_method="build_posterior(..., sample_with='mcmc')",
    142 )[0]  # [0] to return only samples, not acceptance probabilities.
    144 return samples[:, 0]

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\utils\_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\samplers\rejection\rejection.py:281, in accept_reject_sample(proposal, accept_reject_fn, num_samples, show_progress_bars, warn_acceptance, sample_for_correction_factor, max_sampling_batch_size, proposal_sampling_kwargs, alternative_method, **kwargs)
    278 num_samples_possible = 0
    279 while num_remaining > 0:
    280     # Sample and reject.
--> 281     candidates = proposal.sample(
    282         (sampling_batch_size,),  # type: ignore
    283         **proposal_sampling_kwargs,
    284     )
    285     # SNPE-style rejection-sampling when the proposal is the neural net.
    286     are_accepted = accept_reject_fn(candidates)

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\neural_nets\estimators\nflows_flow.py:137, in NFlowsFlow.sample(self, sample_shape, condition)
    134 condition_batch_dim = condition.shape[0]
    135 num_samples = torch.Size(sample_shape).numel()
--> 137 samples = self.net.sample(num_samples, context=condition)
    138 # Change from Nflows' convention of (batch_dim, sample_dim, *event_shape) to
    139 # (sample_dim, batch_dim, *event_shape) (PyTorch + SBI).
    140 samples = samples.transpose(0, 1)

File ~\.conda\envs\ptgpu\Lib\site-packages\nflows\distributions\base.py:65, in Distribution.sample(self, num_samples, context, batch_size)
     62     context = torch.as_tensor(context)
     64 if batch_size is None:
---> 65     return self._sample(num_samples, context)
     67 else:
     68     if not check.is_positive_int(batch_size):

File ~\.conda\envs\ptgpu\Lib\site-packages\nflows\flows\base.py:44, in Flow._sample(self, num_samples, context)
     43 def _sample(self, num_samples, context):
---> 44     embedded_context = self._embedding_net(context)
     45     noise = self._distribution.sample(num_samples, context=embedded_context)
     47     if embedded_context is not None:
     48         # Merge the context dimension with sample dimension in order to apply the transform.

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\container.py:250, in Sequential.forward(self, input)
    248 def forward(self, input):
    249     for module in self:
--> 250         input = module(input)
    251     return input

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~\.conda\envs\ptgpu\Lib\site-packages\torch\nn\modules\module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~\.conda\envs\ptgpu\Lib\site-packages\sbi\utils\sbiutils.py:252, in Standardize.forward(self, tensor)
    251 def forward(self, tensor):
--> 252     return (tensor - self._mean) / self._std

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I appreciate any thoughts or feedback.

Thank you
Ali

Answered by janfb

Jan 15, 2025

Hi @ali-akhavan89

in general, when you sample from the posterior that uses a large embedding net / is conditioned on large data, it will generate samples through forward passes through the underlying embedding net and density estimator. Thus, the custom embedding net will always be used. If you draw 1,000,000 samples, these samples will be accumulated on the GPU, which is likely the cause for the out of memory error. Thus, if you really need that many samples, a workaround would be using a for-loop to draw 10 x 100,000 samples, or 100 x 10,000 samples etc. and moving the individual batches to CPU in each iteration.

Regarding the device mismatch error after moving the posterior estimator t…

View full answer

michaeldeistler · 2025-01-15T09:10:29Z

michaeldeistler
Jan 15, 2025
Maintainer

Hi there,

thanks for raising this! A quick question: What is num_sims in your code? Ie what is its value?

Also, what is the dimensionality of theta and x?

Michael

0 replies

janfb · 2025-01-15T09:15:40Z

janfb
Jan 15, 2025
Maintainer

Hi @ali-akhavan89

in general, when you sample from the posterior that uses a large embedding net / is conditioned on large data, it will generate samples through forward passes through the underlying embedding net and density estimator. Thus, the custom embedding net will always be used. If you draw 1,000,000 samples, these samples will be accumulated on the GPU, which is likely the cause for the out of memory error. Thus, if you really need that many samples, a workaround would be using a for-loop to draw 10 x 100,000 samples, or 100 x 10,000 samples etc. and moving the individual batches to CPU in each iteration.

Regarding the device mismatch error after moving the posterior estimator to CPU: you need to make sure that everything lives on the CPU then - the prior, the posterior object, the data, and the net. Ideally, you just create a new NPE object and a new prior object and build a new posterior using the CPU-based trained posterior_estimator:

trainer = NPE(prior_cpu)
posterior trainer.build_posterior(density_estimator=posterior_estimator_on_cpu).set_default(x_o_on_cpu)

does this help?

Cheers,
Jan

0 replies

ali-akhavan89 · 2025-01-22T03:16:37Z

ali-akhavan89
Jan 22, 2025
Author

Thank you. We use a licensed software (Vensim) that makes it challenging for us to share the simulator. I am in the process of developing a minimal example using only Python, which I hopefully share soon. We are dealing with time series data and that's why we rely on GPU (mostly RNN networks that we've borrowed from BayesFlow and revised based on the feedback we received from this community).

Regarding the comments, sometimes the problems have 12-24 dimensions that become more complex with number of time points going above 200.

Also, I've been working on the cpu solution, but I have struggled a bit, which I shared here #1368. However, in low-dimensional problems, using 10,000 samples for accept_reject_fn works fine. I'm not sure why the default value was set to 1,000,000, which made me confused and prevented me to test fewer sample sizes. Yet, in high-dimensional problems (12 dimensions, 300 time points, and 3 outcome time series), 10,000 sample size would lead to 44GB RAM on CUDA, which I'm still investing.

Thanks again for the suggestions, and I'll keep you updated.

Best,
Ali

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated proposals for SNPE (TSNPE) implementation #1354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Truncated proposals for SNPE (TSNPE) implementation #1354

ali-akhavan89 Jan 8, 2025

Replies: 3 comments

michaeldeistler Jan 15, 2025 Maintainer

janfb Jan 15, 2025 Maintainer

ali-akhavan89 Jan 22, 2025 Author

ali-akhavan89
Jan 8, 2025

michaeldeistler
Jan 15, 2025
Maintainer

janfb
Jan 15, 2025
Maintainer

ali-akhavan89
Jan 22, 2025
Author