Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error when restart model training #1023

Closed
Ericwang6 opened this issue Aug 24, 2021 · 0 comments
Closed

[BUG] Error when restart model training #1023

Ericwang6 opened this issue Aug 24, 2021 · 0 comments
Labels

Comments

@Ericwang6
Copy link
Member

Summary

I've met the following error when using dp train input.json --restart model.ckpt command. My deepmd-kit version is 2.0.0.b4

Traceback (most recent call last):
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
         [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 425, in main
    train_dp(**dict_args)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 89, in train
    _do_work(jdata, run_opt)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 147, in _do_work
    model.train(train_data, valid_data)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/train/trainer.py", line 421, in train
    self._init_session()
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/train/trainer.py", line 399, in _init_session
    self.saver.restore (self.sess, self.run_opt.restart)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1303, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
         [[node save/RestoreV2 (defined at /lib/python3.9/site-packages/deepmd/train/trainer.py:383) ]]

Original stack trace for 'save/RestoreV2':
  File "/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 425, in main
    train_dp(**dict_args)
  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 89, in train
    _do_work(jdata, run_opt)
  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 147, in _do_work
    model.train(train_data, valid_data)
  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 421, in train
    self._init_session()
  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 383, in _init_session
    self.saver = tf.train.Saver()
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)
@Ericwang6 Ericwang6 added the bug label Aug 24, 2021
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`.  To not break any behaviors, a symlink will then be made from  `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.)
This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`.  To not break any behaviors, a symlink will then be made from  `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.)
This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
amcadmus pushed a commit that referenced this issue Aug 25, 2021
This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`.  To not break any behaviors, a symlink will then be made from  `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.)
This can fix #1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
@njzjz njzjz closed this as completed Aug 25, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this issue Sep 2, 2021
)

This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`.  To not break any behaviors, a symlink will then be made from  `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.)
This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants