-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Error when restart model training #1023
Labels
Comments
njzjz
added a commit
to njzjz/deepmd-kit
that referenced
this issue
Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
njzjz
added a commit
to njzjz/deepmd-kit
that referenced
this issue
Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
amcadmus
pushed a commit
that referenced
this issue
Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix #1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
gzq942560379
pushed a commit
to HPC-AI-Team/deepmd-kit
that referenced
this issue
Sep 2, 2021
) This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
I've met the following error when using
dp train input.json --restart model.ckpt
command. My deepmd-kit version is 2.0.0.b4The text was updated successfully, but these errors were encountered: