Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-initializing main() because the training of light MLP diverged and all the values ​​are zero. #4

Closed
YONGHUICAI opened this issue Nov 29, 2023 · 14 comments

Comments

@YONGHUICAI
Copy link

Thanks for open source such a great project!
When I trained the yufeng and marcel data sets, errors quickly occurred: Re-initializing main() because the training of light MLP diverged and all the values ​​are zero.
Code tested on RTX 3090Ti
How can I solve this problem?

@sbharadwajj
Copy link
Owner

Hi

Did the code restart or crash?
typically it should restart.

@YONGHUICAI
Copy link
Author

Hi

Did the code restart or crash? typically it should restart.

It keeps restarting and then keeps repeating this error.
I want to know what causes this error and how I can solve it.
Thanks.

@sbharadwajj
Copy link
Owner

sbharadwajj commented Nov 30, 2023

Hi,

The only solution unfortunately is to restart the code. I did not find time to check if it is a GPU problem because in one of the GPU I used, it did not occur and ran into the problem only on other GPUs.

So please try restarting the code. If it still does not work, let me know!

The error is caused because we do not use an activation function for the output layer of the light MLP to keep the values unconstrained as we have tonemap the values.

@yydxlv
Copy link

yydxlv commented Dec 1, 2023

The same problem is occured on A800

@sbharadwajj
Copy link
Owner

Hi,

does the code never work or does it happen only a few times?

@YONGHUICAI
Copy link
Author

Hi,

does the code never work or does it happen only a few times?

Unfortunately, the code never work

@sbharadwajj
Copy link
Owner

sbharadwajj commented Dec 4, 2023

Hi,
Can you please run the code once again and copy-paste the exact log that is printed on the terminal?

@Orange-Ctrl
Copy link

Orange-Ctrl commented Dec 6, 2023

Hi,
I run the code on rtx3090 and got the same problem.
Snipaste_2023-12-06_16-56-40

@Orange-Ctrl
Copy link

I find the problem, it's because lack of the module robust_loss_pytorch. It runs successfully now!
image

@YONGHUICAI
Copy link
Author

Hi, I run the code on rtx3090 and got the same problem. Snipaste_2023-12-06_16-56-40

yeah, pip install git+https://github.com/jonbarron/robust_loss_pytorch
it works!

@sbharadwajj
Copy link
Owner

Glad to hear it works now :)

I think I forgot to include it in the requirements.txt.

@Orange-Ctrl i can also see a warning of tinycudann installation in your log image. You have compiled tinycudann on a different GPU device and running it on another one. To get the best performance, please make sure it is properly installed.

@zydmu123
Copy link

Sorry to bother you @sbharadwajj But it really doesn't work even with the installation of above "robust_loss_pytorch", my GPU is RTX3090, never run successfully for once! Could you give me some help, really thanks!

@Yingyan-Xu
Copy link

Hi @zydmu123 did you manage to run the code in the end? I'm having the same issue on RTX3090 and I also tried robust_loss_pytorch and it didn't work.

@sbharadwajj
Copy link
Owner

@Yingyan-Xu did you verify if the mask if correct? Can you quickly save the mask and check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants