-
Notifications
You must be signed in to change notification settings - Fork 1k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why stair-like loss curve? #101
Comments
I can't speak to the training run for the graph as I didn't do them, @mitchellnw would have a better idea... but looks like it could be a shuffling issue (as in not properly shuffling) |
My guess is also a shuffling issue with webdataset when these were run |
If the data is not preshuffled you need both shards shuffling and local shuffling |
Perhaps it is not a shuffling issue with webdataset, since I train the model on CC3M (csv dataset), and observed the following curves. which looks very similar to this curve in Loss increases within each epoch, then decrease after each epoch... |
@ChenDelong1999 did you preshuffle the dataset (sort randomly the dataset) ? |
CsvDataset should be shuffled every epoch, pre-shuffle isn' really relevant. Might be worth checking that open_clip/src/training/train.py Line 62 in d9ee4aa
|
Looking at this again I wonder if it is caused by the I would expect that stair-like But I have no guesses for why To test this hypothesis I would use a 10x smaller learning rate on the |
@mitchellnw I've noticed that the scale param has interesting relationship with the LR/loss, I wonder if it's almost behaving in a slightly oscillatory control systems fashion. The scale is strongly impacted by the LR as well, if the LR is high enough the scale will not converge to 100 until it lowers |
Interesting. I wonder how accuracy/loss would be impacted if this learnable param was replaced by a scheduled param---something like 100 - k*cosine_decay(iteration). |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
As in here as well as my own implementation, stair-like loss curves are observed. Any possible reason for this?
The text was updated successfully, but these errors were encountered: