save checkpoint files with step and keep recent files #1031
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit saves checkpoint to
save_ckpt-step
(e.g.model.ckpt-100
) instead ofsave_ckpt
(e.g.model.ckpt
), and keeps 5 recent checkpoint files (this is a default value oftf.Saver
). Such thing is conducted bytf.Saver
. To not break any behaviors, a symlink will then be made frommodel.ckpt-100
tomodel.ckpt
. (Usually such thing should be controlled bycheckpoint
file, but deepmd-kit doesn't read this file.)This can fix #1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.