Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck at Epoch 15 #23

Open
Brunettow opened this issue Jan 10, 2024 · 2 comments
Open

Training stuck at Epoch 15 #23

Brunettow opened this issue Jan 10, 2024 · 2 comments

Comments

@Brunettow
Copy link

Hello,
I can train the model since the process kills itself after this message:

Building the data loader. Curriculum = 3/8, length = 32218.
Epoch 15 acc/qa=1.000000 loss=0.046158 loss/qa=0.046158 time/data=0.008719 time/step=1.016501: 100%|##############################| 1006/1006 [18:08<00:00, 1.08s/it]
Epoch 15 (validation) validation/acc/qa=1.000000: 2%|#4 | 20/1094 [00:41<11:04, 1.62it/s]/home/colors/Desktop/nscl/Jacinle/bin/jac-crun: line 6: 3305 Killed $JACROOT/bin/jac-run "$@"

@Brunettow Brunettow changed the title Stuck at Epoch 15 Training stuck at Epoch 15 Jan 10, 2024
@vacancy
Copy link
Owner

vacancy commented Jan 10, 2024

This seems to be a memory issue, or someone else killed your process. Is it possible to run on a machine with a larger memory size?

@Brunettow
Copy link
Author

Brunettow commented Jan 14, 2024

First of all, I would like to inform that I tried to train it again and it killed itself at the same position.
Yes, It's possible to try it with a different machine, however do I have another option that I can try before doing that? Because It was really hard to set up the environment for model to work because of the dependencies. It would be really great If the docker image of the project was available.
Thank you for your answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants