Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

mhirki · 2024-11-09T14:15:30Z

I saw the random validation errors were back after commit 48cfc09 removed some of my earlier fixes for the random validation issues. This time I decided to dig to the bottom of the issue of why the pipeline is randomly on the cpu.

The root cause turned out to be an early return in setup_pipeline which means that this line at the end of setup_pipeline is never executed:

        self.pipeline = self.pipeline.to(self.inference_device)

And torch.compile before that is also never called.

I did notice that moving the entire pipeline to the accelerator device has some unwanted side-effects. It also moved text_encoder_2 and text_encoder_3 to the GPU when I was testing with SD3.5. Setting text_encoder_2 and text_encoder_3 to None in extra_pipeline_kwargs prevents them from being loaded in the first place. I'm hoping other pipelines won't complain about text_encoder_2 and text_encoder_3 in kwargs. I tried throwing in text_encoder_4 there and the SD3 pipeline didn't complain.

I tried testing validation_torch_compile but it seemed to be taking forever so I eventually just ended up killing it.

…dom validation errors with SD3.5 after commit 48cfc09 removed some earlier fixes. This also fixes torch.compile not getting called for the validation pipeline. Calling self.pipeline.to(self.inference_device) appears to have an unwanted side-effect: it moves additional text encoders to the accelerator device. In the case of SD3.5, I saw text_encoder_2 and text_encoder_3 getting moved to the GPU. This caused my RTX 3090 to go OOM when trying to generate validation images during training. Explicitly setting text_encoder_2 and text_encoder_3 to None in extra_pipeline_kwargs fixes this issue.

bghira · 2024-11-09T14:45:45Z

the reason this is done is because some of diffusers pipelines require the text encoder to be there simply so that its dtype can be checked 😮‍💨

i could copy those pipelines in and fix them, but they'll need to be checked for compatibility fully first.

i don't want to move the whole pipeline to GPU when doing validations because the necessary components are already there - the text encoders then remained on CPU. but it'd be better for them to not be loaded at all.

helpers/training/validation.py

bghira reviewed Nov 9, 2024

View reviewed changes

helpers/training/validation.py Outdated Show resolved Hide resolved

Apply suggested changes proposed by bghira.

648c4ad

bghira merged commit 3293fa0 into bghira:main Nov 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

mhirki commented Nov 9, 2024

bghira commented Nov 9, 2024

Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

Conversation

mhirki commented Nov 9, 2024

bghira commented Nov 9, 2024