[BUG] Error when restart model training #1023

Ericwang6 · 2021-08-24T09:38:40Z

Summary

I've met the following error when using dp train input.json --restart model.ckpt command. My deepmd-kit version is 2.0.0.b4

Traceback (most recent call last):
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
         [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 425, in main
    train_dp(**dict_args)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 89, in train
    _do_work(jdata, run_opt)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 147, in _do_work
    model.train(train_data, valid_data)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/train/trainer.py", line 421, in train
    self._init_session()
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/deepmd/train/trainer.py", line 399, in _init_session
    self.saver.restore (self.sess, self.run_opt.restart)
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 1303, in restore
    sess.run(self.saver_def.restore_op_name,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/public/wangyingze/local/deepmd-kit-2.0.0.b4/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
         [[node save/RestoreV2 (defined at /lib/python3.9/site-packages/deepmd/train/trainer.py:383) ]]

Original stack trace for 'save/RestoreV2':
  File "/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 425, in main
    train_dp(**dict_args)
  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 89, in train
    _do_work(jdata, run_opt)
  File "/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 147, in _do_work
    model.train(train_data, valid_data)
  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 421, in train
    self._init_session()
  File "/lib/python3.9/site-packages/deepmd/train/trainer.py", line 383, in _init_session
    self.saver = tf.train.Saver()
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 876, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 515, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 335, in _AddRestoreOps
    all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
  File "/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/lib/python3.9/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1490, in restore_v2
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/lib/python3.9/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

The text was updated successfully, but these errors were encountered:

This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.

This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix #1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.

) This commit saves checkpoint to `save_ckpt-step` (e.g. `model.ckpt-100`) instead of `save_ckpt` (e.g. `model.ckpt`), and keeps 5 recent checkpoint files (this is a default value of `tf.Saver`). Such thing is conducted by `tf.Saver`. To not break any behaviors, a symlink will then be made from `model.ckpt-100` to `model.ckpt`. (Usually such thing should be controlled by `checkpoint` file, but deepmd-kit doesn't read this file.) This can fix deepmodeling#1023, as (1) we made symlink after a checkpoint has been already saved; (2) if something is still wrong, one can use a previous checkpoint instead.

Ericwang6 added the bug label Aug 24, 2021

njzjz mentioned this issue Aug 25, 2021

save checkpoint files with step and keep recent files #1031

Merged

njzjz closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error when restart model training #1023

[BUG] Error when restart model training #1023

Ericwang6 commented Aug 24, 2021

[BUG] Error when restart model training #1023

[BUG] Error when restart model training #1023

Comments

Ericwang6 commented Aug 24, 2021