Not benefiting from checkpointing #297

mstfldmr · 2022-09-27T14:35:02Z

Hello,

I save checkpoints with:

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
  filepath = checkpoint_path + '/epoch-{epoch:02d}/',
  monitor = 'val_loss',
  save_freq = 'epoch',
  save_weights_only = False,
  save_best_only = False,
  mode = 'auto')

After loading the latest checkpoint and continuing training, I would expect the loss value to be around the loss value in the last checkpoint.

    model = tf.keras.models.load_model(
        model_path,
        custom_objects={"SimilarityModel": tfsim.models.SimilarityModel,
                        'MyOptimizer': tfa.optimizers.RectifiedAdam})

    model.load_index(model_path)

    model.fit(
        datasampler,
        callbacks = callbacks,
        epochs = args.epochs,
        initial_epoch=initial_epoch_number,
        steps_per_epoch = N_TRAIN_SAMPLES ,
        verbose=2
    )

However, the loss value does not continue from where it left. It looks like it's simply starting the training from scratch and not benefiting from checkpoints.

The text was updated successfully, but these errors were encountered:

owenvallis · 2022-10-24T04:29:33Z

Thanks for submitting the issue @mstfldmr. Do you have a simple example I can use to try and repro the issue? I can also try and repro this using our basic example, but it might be good to get closer to your current set up as well.

mstfldmr · 2022-11-05T11:43:42Z

@owenvallis I'm sorry, I can't share the full code because it has some confidential pieces we developed. This was how I configured checkpointing:

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='loss', save_freq='epoch', save_weights_only=False, save_best_only=False)

and how I loaded a checkpoint back:

    resumed_model_from_checkpoints = tf.keras.models.load_model(f'{checkpoint_path}/{max_epoch_filename}')

mstfldmr · 2022-11-21T12:24:06Z

@owenvallis could you reproduce it?

owenvallis · 2022-11-23T17:56:49Z

Hi @mstfldmr, sorry for the delay here. I'll try and get to this this week.

owenvallis · 2022-12-02T18:56:49Z

Looking into this now as it also looks like there is a breaking change in 2.8 where they removed Optimizer.get_weights() (see keras-team/tf-keras#442). That issue also mentions that SavedModel didn't properly save the weights for certain optimizers in the past (see tensorflow/tensorflow#44670).

Which optimizer were you using? Was it Adam?

mstfldmr · 2022-12-05T19:13:45Z

@owenvallis yes, it was tfa.optimizers.RectifiedAdam.

mstfldmr changed the title ~~Loss is too high after loading checkpoints and continuing training~~ Not benefiting from checkpointing Sep 27, 2022

owenvallis self-assigned this Oct 24, 2022

owenvallis added type:bug Something isn't working component:model component:losses Issues related to support additional metric learning technqiues (e.g loss) labels Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not benefiting from checkpointing #297

Not benefiting from checkpointing #297

mstfldmr commented Sep 27, 2022 •

edited

Loading

owenvallis commented Oct 24, 2022

mstfldmr commented Nov 5, 2022

mstfldmr commented Nov 21, 2022

owenvallis commented Nov 23, 2022

owenvallis commented Dec 2, 2022 •

edited

Loading

mstfldmr commented Dec 5, 2022 •

edited

Loading

Not benefiting from checkpointing #297

Not benefiting from checkpointing #297

Comments

mstfldmr commented Sep 27, 2022 • edited Loading

owenvallis commented Oct 24, 2022

mstfldmr commented Nov 5, 2022

mstfldmr commented Nov 21, 2022

owenvallis commented Nov 23, 2022

owenvallis commented Dec 2, 2022 • edited Loading

mstfldmr commented Dec 5, 2022 • edited Loading

mstfldmr commented Sep 27, 2022 •

edited

Loading

owenvallis commented Dec 2, 2022 •

edited

Loading

mstfldmr commented Dec 5, 2022 •

edited

Loading