Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) #1131

Merged
merged 2 commits into from
Nov 10, 2024

Conversation

mhirki
Copy link
Contributor

@mhirki mhirki commented Nov 9, 2024

I saw the random validation errors were back after commit 48cfc09 removed some of my earlier fixes for the random validation issues. This time I decided to dig to the bottom of the issue of why the pipeline is randomly on the cpu.

The root cause turned out to be an early return in setup_pipeline which means that this line at the end of setup_pipeline is never executed:

        self.pipeline = self.pipeline.to(self.inference_device)

And torch.compile before that is also never called.

I did notice that moving the entire pipeline to the accelerator device has some unwanted side-effects. It also moved text_encoder_2 and text_encoder_3 to the GPU when I was testing with SD3.5. Setting text_encoder_2 and text_encoder_3 to None in extra_pipeline_kwargs prevents them from being loaded in the first place. I'm hoping other pipelines won't complain about text_encoder_2 and text_encoder_3 in kwargs. I tried throwing in text_encoder_4 there and the SD3 pipeline didn't complain.

I tried testing validation_torch_compile but it seemed to be taking forever so I eventually just ended up killing it.

…dom validation errors with SD3.5 after commit 48cfc09 removed some earlier fixes. This also fixes torch.compile not getting called for the validation pipeline.

Calling self.pipeline.to(self.inference_device) appears to have an unwanted side-effect: it moves additional text encoders to the accelerator device. In the case of SD3.5, I saw text_encoder_2 and text_encoder_3 getting moved to the GPU. This caused my RTX 3090 to go OOM when trying to generate validation images during training. Explicitly setting text_encoder_2 and text_encoder_3 to None in extra_pipeline_kwargs fixes this issue.
@bghira
Copy link
Owner

bghira commented Nov 9, 2024

the reason this is done is because some of diffusers pipelines require the text encoder to be there simply so that its dtype can be checked 😮‍💨

i could copy those pipelines in and fix them, but they'll need to be checked for compatibility fully first.

i don't want to move the whole pipeline to GPU when doing validations because the necessary components are already there - the text encoders then remained on CPU. but it'd be better for them to not be loaded at all.

@bghira bghira merged commit 3293fa0 into bghira:main Nov 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants