Why stair-like loss curve? #101

ChenDelong1999 · 2022-05-30T04:27:14Z

As in here as well as my own implementation, stair-like loss curves are observed. Any possible reason for this?

rwightman · 2022-06-01T16:41:21Z

I can't speak to the training run for the graph as I didn't do them, @mitchellnw would have a better idea... but looks like it could be a shuffling issue (as in not properly shuffling)

mitchellnw · 2022-06-01T18:35:35Z

My guess is also a shuffling issue with webdataset when these were run

rom1504 · 2022-06-01T18:43:17Z

If the data is not preshuffled you need both shards shuffling and local shuffling

ChenDelong1999 · 2022-06-02T05:16:26Z

Perhaps it is not a shuffling issue with webdataset, since I train the model on CC3M (csv dataset), and observed the following curves.

which looks very similar to this curve in open_clip/docs/clip_conceptual_captions.md:

Loss increases within each epoch, then decrease after each epoch...

rom1504 · 2022-06-02T07:43:14Z

@ChenDelong1999 did you preshuffle the dataset (sort randomly the dataset) ?

rwightman · 2022-06-02T15:11:50Z

CsvDataset should be shuffled every epoch, pre-shuffle isn' really relevant. Might be worth checking that

open_clip/src/training/train.py

Line 62 in d9ee4aa

sampler.set_epoch(epoch)

set_epoch is def being called in distributed case...

mitchellnw · 2022-06-09T18:57:11Z

Looking at this again I wonder if it is caused by the scale param, which also exhibits stair-like behaviour.

I would expect that stair-like scale => stair-like loss.

But I have no guesses for why scale has stair-like behaviour.

To test this hypothesis I would use a 10x smaller learning rate on the scale parameter and see if this resolves the issue.

rwightman · 2022-06-09T19:03:32Z

@mitchellnw I've noticed that the scale param has interesting relationship with the LR/loss, I wonder if it's almost behaving in a slightly oscillatory control systems fashion. The scale is strongly impacted by the LR as well, if the LR is high enough the scale will not converge to 100 until it lowers

mitchellnw · 2022-06-09T19:09:09Z

Interesting. I wonder how accuracy/loss would be impacted if this learnable param was replaced by a scheduled param---something like 100 - k*cosine_decay(iteration).

viyjy · 2022-08-28T22:09:22Z

here

Hi, thanks for your answer, but the scale param does not exhibit stair-like behaviour during the training process, isn't it? On the other hand, scale is learnable, it shouldn't be stair-like, right?

mlfoundations locked and limited conversation to collaborators Nov 28, 2022

rom1504 converted this issue into discussion #262 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Why stair-like loss curve? #101

Why stair-like loss curve? #101

ChenDelong1999 commented May 30, 2022

rwightman commented Jun 1, 2022

mitchellnw commented Jun 1, 2022

rom1504 commented Jun 1, 2022

ChenDelong1999 commented Jun 2, 2022

rom1504 commented Jun 2, 2022

rwightman commented Jun 2, 2022

mitchellnw commented Jun 9, 2022 •

edited

Loading

rwightman commented Jun 9, 2022

mitchellnw commented Jun 9, 2022

viyjy commented Aug 28, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Why stair-like loss curve? #101

Why stair-like loss curve? #101

Comments

ChenDelong1999 commented May 30, 2022

rwightman commented Jun 1, 2022

mitchellnw commented Jun 1, 2022

rom1504 commented Jun 1, 2022

ChenDelong1999 commented Jun 2, 2022

rom1504 commented Jun 2, 2022

rwightman commented Jun 2, 2022

mitchellnw commented Jun 9, 2022 • edited Loading

rwightman commented Jun 9, 2022

mitchellnw commented Jun 9, 2022

viyjy commented Aug 28, 2022

This issue was moved to a discussion.

mitchellnw commented Jun 9, 2022 •

edited

Loading