GPU device placement issues #610

LouisRouillard · 2022-01-26T09:34:32Z

I think most of the issues encountered by users result from device strings processing and mismatch. I'm trying a bunch of error-prone combinations and trying to see how can those mismatch be catched and corrected.

For now I only tried very small changes.

2 (known) issues remain:

prior device mismatch
embedding net with custom density estimator mismatch

But both items do not expose any device attribute for now so maybe more invasive changes could be needed

janfb · 2022-01-26T09:49:02Z

looks good, thanks!

Regarding the prior device: there is a function for checking the prior device, but it is not used currently:
https://github.com/mackelab/sbi/blob/bc4d43bf60ec714790ef8baa4b08cc6e5b821e45/sbi/utils/torchutils.py#L42-L51

We could use this function in sbi/inference/base.py to check the prior (if it is not None).

Regarding the embedding and density estimator: this happens outside of the SBI loop, no? E.g., the user would use get_nn_models function to build their density estimator with embedding net, right? Here one quick fix would be to check the embedding net for its device, and then
a) throw an error that it should be on the CPU for now, or
b) throw a warning and move it to CPU to then compose it with the density estimator on the CPU.

I would tend to b). And you?

@michaeldeistler @jan-matthis what do you think?

codecov-commenter · 2022-01-26T10:17:57Z

Codecov Report

Merging #610 (9807de0) into main (8657c55) will increase coverage by 0.27%.
The diff coverage is 80.00%.

@@            Coverage Diff             @@
##             main     #610      +/-   ##
==========================================
+ Coverage   68.44%   68.72%   +0.27%     
==========================================
  Files          67       67              
  Lines        4443     4476      +33     
==========================================
+ Hits         3041     3076      +35     
+ Misses       1402     1400       -2

Flag	Coverage Δ
unittests	`68.72% <80.00%> (+0.27%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sbi/utils/torchutils.py	`63.97% <50.00%> (+3.67%)`	⬆️
sbi/utils/user_input_checks.py	`87.62% <73.33%> (-0.84%)`	⬇️
sbi/inference/base.py	`72.84% <100.00%> (+0.18%)`	⬆️
sbi/neural_nets/classifier.py	`100.00% <100.00%> (ø)`
sbi/neural_nets/flow.py	`87.20% <100.00%> (+1.13%)`	⬆️
sbi/neural_nets/mdn.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8657c55...9807de0. Read the comment docs.

michaeldeistler · 2022-01-26T13:59:45Z

I'm having trouble understanding which cases are going wrong.

Is the issue that get_posterior_nn() (etc) can not handle the embedding_net being on the GPU?

janfb · 2022-01-26T14:32:54Z

Yes, that's one of the problems. We designed sbi such that the user would just say device="cuda" and we take care of moving everything to the device. However, it happens that users have their priors, data, or embeddings on the device already, and then posterior_nn() would compose a Flow with mixed devices and things break.

LouisRouillard · 2022-01-26T14:41:35Z

Hey ! Sorry I had a bunch of stuff to work on, I'll work on the aforementioned issues some more tonight. @janfb I personally like APIs that are not error compliant but indicate clearly what went wrong -so option a-. But sbi is generally built around the "warn and correct" concept so I'll try to implement b

LouisRouillard · 2022-01-26T14:42:45Z

@michaeldeistler I can show you the cases that go wrong in screenshare like I did with @janfb at some point if you want?

LouisRouillard · 2022-01-26T16:35:41Z

I've done a couple functions that check what seems to be the most common error cases. I'm not sure I have propagated those functions everywhere needed though. For the embedding_net I managed to implement a "warn and correct" behavior, but not for the prior check because I feel like I would have needed to "wrap" a potentially custom prior into another prior that replicates the sampling and then moves the data accordingly. Tell me what you think ?

LouisRouillard · 2022-01-26T16:46:46Z

Arf I realize I redefined a prior checking function. I'll remove it.

michaeldeistler · 2022-01-26T17:14:32Z

@LouisRouillard I think quickest for now would be to just do a single commit with the black formatting changes. Then I'll quickly have a look :)

janfb · 2022-01-27T08:06:58Z

sbi/utils/user_input_checks.py

@@ -372,6 +372,23 @@ def check_prior_support(prior):
        )


+def check_embedding_net_device(embedding_net: nn.Module, batch_y: torch.Tensor) -> None:
+    batch_y_device = batch_y.device


what if the user passes data on GPU? Then this check would pass and the embedding would remain on the GPU and cause issues with the density estimator that is constructed on CPU, no?

If we assume that we always build the density estimator on the CPU, then we could just always move the embedding to the GPU, no? (and warn, or throw an error depending on whether we want to warn and correct or not. )

I'm not sure this situation will arise in the sense that the mismatch between prior and estimator is checked before that step? So before this is ran we know that batch_y (generated from the prior) is the same device as estimator, and all that remains to do is to check that the embedder is also in that case? I guess the question is: is there use case -unknown to me- where the build functions are used outside of the classic "pipeline" ?

LouisRouillard · 2022-01-28T14:35:44Z

Ok so I advanced on the tests. I think I have good coverage over the few functions I implemented. I expanded the integration test test_train_with_different_data_and_training_device and added a new integration test test_embedding_nets_integration_training_device that tackles the use case that was failing for me. It checks that everything runs even with wrong device assignments and that the users is warned when data or modules are moved automatically.

LouisRouillard · 2022-01-28T14:37:04Z

I sadly have conflicts, I'll try to rebase over main and pray it's not too bad ^^'

michaeldeistler · 2022-01-28T14:40:57Z

Watch out for sbi/utils/plot.py -- the file got deleted along the way

michaeldeistler · 2022-01-28T15:57:45Z

Great! Is this ready for review?

michaeldeistler · 2022-01-28T15:58:02Z

Ah just saw discord...I'll review it now

michaeldeistler

This is great! Thanks a lot for looking into this and understanding and fixing the issue. Good to go from my side!

sbi/neural_nets/classifier.py

sbi/utils/torchutils.py

LouisRouillard · 2022-01-28T16:20:59Z

All test in tests/inference_on_device_test.py and pytest tests/user_input_checks_test.py including slow and gpu run locally !

janfb

great code, great tests, thanks a lot!
I added small comments on types and some refactoring. This is good to go in once they are addressed.

sbi/neural_nets/classifier.py

sbi/neural_nets/flow.py

janfb · 2022-01-31T14:39:12Z

sbi/neural_nets/mdn.py

+    assert batch_x.device == batch_y.device, (
+        "Mismatch in fed data's device: "
+        f"batch_x has device '{batch_x.device}' whereas "
+        f"batch_x has device '{batch_x.device}'. Please "


typo: y vs x

given we have >5 of those asserts you could write a function that checks two tensors for their device and prints a (slightly more general) error message?

janfb · 2022-01-31T14:43:14Z

sbi/utils/torchutils.py

-            device = "cpu"
-
-    return device
+            return device


 def check_if_prior_on_device(device, prior: Optional[Any] = None):
    if prior is not None:


general suggestion: we could have a

if prior is None: pass else: ...

to improve readability?

Will change it thanks :)

sbi/utils/user_input_checks.py

LouisRouillard

Thanks @janfb for the comments !

sbi/neural_nets/classifier.py

LouisRouillard · 2022-02-01T10:30:51Z

sbi/neural_nets/mdn.py

+    assert batch_x.device == batch_y.device, (
+        "Mismatch in fed data's device: "
+        f"batch_x has device '{batch_x.device}' whereas "
+        f"batch_x has device '{batch_x.device}'. Please "


sbi/utils/user_input_checks.py

janfb

Great, thanks a lot! Go to go in once CI is passing

LouisRouillard changed the title ~~Small changes to device st processing functions~~ GPU device placement issues Jan 26, 2022

janfb reviewed Jan 27, 2022

View reviewed changes

Rebase: GPU device placement

efdbdd4

LouisRouillard force-pushed the GPU branch from 593d0cd to efdbdd4 Compare January 28, 2022 15:44

michaeldeistler approved these changes Jan 28, 2022

View reviewed changes

sbi/neural_nets/classifier.py Show resolved Hide resolved

sbi/utils/torchutils.py Show resolved Hide resolved

janfb requested changes Jan 31, 2022

View reviewed changes

LouisRouillard commented Feb 1, 2022

View reviewed changes

Include review comments from @janfb

9807de0

janfb approved these changes Feb 1, 2022

View reviewed changes

janfb merged commit d373ff6 into sbi-dev:main Feb 1, 2022

janfb mentioned this pull request Feb 17, 2022

GPU issues on SLURM cluster #508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU device placement issues #610

GPU device placement issues #610

LouisRouillard commented Jan 26, 2022

janfb commented Jan 26, 2022

codecov-commenter commented Jan 26, 2022 •

edited

Loading

michaeldeistler commented Jan 26, 2022 •

edited

Loading

janfb commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

michaeldeistler commented Jan 26, 2022

janfb Jan 27, 2022

LouisRouillard Jan 27, 2022

LouisRouillard commented Jan 28, 2022 •

edited

Loading

LouisRouillard commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler left a comment

LouisRouillard commented Jan 28, 2022

janfb left a comment

janfb Jan 31, 2022

janfb Jan 31, 2022

LouisRouillard Feb 1, 2022

janfb Jan 31, 2022

LouisRouillard Feb 1, 2022

LouisRouillard left a comment

LouisRouillard Feb 1, 2022

janfb left a comment

GPU device placement issues #610

GPU device placement issues #610

Conversation

LouisRouillard commented Jan 26, 2022

janfb commented Jan 26, 2022

codecov-commenter commented Jan 26, 2022 • edited Loading

Codecov Report

michaeldeistler commented Jan 26, 2022 • edited Loading

janfb commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

LouisRouillard commented Jan 26, 2022

michaeldeistler commented Jan 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LouisRouillard commented Jan 28, 2022 • edited Loading

LouisRouillard commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler commented Jan 28, 2022

michaeldeistler left a comment

Choose a reason for hiding this comment

LouisRouillard commented Jan 28, 2022

janfb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LouisRouillard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janfb left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 26, 2022 •

edited

Loading

michaeldeistler commented Jan 26, 2022 •

edited

Loading

LouisRouillard commented Jan 28, 2022 •

edited

Loading