Reverse weight decay #567

AkshitaB · 2024-05-03T12:02:55Z

Goal: Perform reverse weight decay on embeddings

Multiply weight_decay factor for the embeddings layer by (1 - norm(embeddings))

TODO:

What to do when the log metric interval is > 1?

I tried this on a tiny test model config and got an overflow error. Possibly this will not be an issue with the actual model.

Note: I created the branch from train-olmo-large. See this for actual diffs for this PR.

dirkgr

I think this PR needs to go into the train-olmo-large branch, no?

olmo/train.py

dirkgr · 2024-05-03T16:08:29Z

You are right. Then we need to make sure we compute this every time.

…

On Fri, May 3, 2024, 08:45 Akshita Bhagia ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In olmo/train.py <#567 (comment)>: > + if should_log_optim_metrics_this_step: + emb_decay_factor = 1.0 - optim_metrics["param/transformer.wte.weight.norm"] + else: + emb_decay_factor = 1.0 We compute the norm of the gradient every step ( grad/transformer.wte.weight.norm), not the norm of the parameter itself ( param/transformer.wte.weight.norm). Don't we need the latter? — Reply to this email directly, view it on GitHub <#567 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPRVJA3KTX3NEXIYH5DZAOWLNAVCNFSM6AAAAABHFMSCVWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZYGQ2DGOJZHA> . You are receiving this because you commented.Message ID: ***@***.***>

AkshitaB · 2024-05-06T17:14:51Z

You are right. Then we need to make sure we compute this every time.

Done

olmo/config.py

dirkgr · 2024-05-06T20:18:29Z

@epwalsh , can you look at this as well? This gets all up in your code.

olmo/optim.py

dirkgr · 2024-05-06T19:29:55Z

olmo/optim.py

                if cfg.optimizer.decay_embeddings:
                    decay.add(fpn)
+                elif cfg.optimizer.reverse_embedding_decay:
+                    embeddings_decay.add(fpn)


What happens if these are both set? We should check against that somewhere.

dirkgr · 2024-05-06T19:33:51Z

CHANGELOG.md

@@ -23,6 +23,7 @@ shared memory implementation can be used by passing `use_legacy_shared_mem_impl`
 - Added MMLU multiple choice (A/B/C/D) 5-shot variant downstream tasks
 - Tokenizer patch
 - Added option to specify number of model replicas when using hybrid sharding.
+- Added reverse_embedding_decay option.


This name also needs to be updated.

dirkgr · 2024-05-06T20:19:20Z

olmo/optim.py

@@ -43,6 +43,7 @@ def clip_grads_and_collect_metrics(
        global_step: int,
        collect_param_metrics: bool = True,
        process_group: Optional[dist.ProcessGroup] = None,
+        regularize_embeddings: bool = False,


Why is this a parameter to this function? Shouldn't it be just captured in the parameter groups? That's how all the other regularization works.

dirkgr · 2024-05-06T20:24:44Z

olmo/train.py

+            if group["name"] == "embedding_decay_group":
+                group["weight_decay"] *= emb_decay_factor


Does't this multiply up emb_decay_factor across batches? It feels like this should just be set, not multiplied? Or is there some other bit that resets group["weight_decay"] every time?

dirkgr · 2024-05-06T20:27:26Z

olmo/train.py

+        emb_norm = optim_metrics["param/transformer.wte.weight.norm"]
+        emb_size = self.cfg.model.embedding_size or self.cfg.model.vocab_size
+        emb_std = math.sqrt(math.pow(emb_norm, 2) / float(emb_size * self.cfg.model.vocab_size))
+        emb_decay_factor = 1.0 - emb_std


If we're using this to plug into the value for WD, that means it needs to be negative when we want to pull up the values. So then it would be emb_std - 1?

epwalsh · 2024-05-06T21:02:02Z

olmo/train.py

        )

+        emb_norm = optim_metrics["param/transformer.wte.weight.norm"]
+        emb_size = self.cfg.model.embedding_size or self.cfg.model.vocab_size
+        emb_std = math.sqrt(math.pow(emb_norm, 2) / float(emb_size * self.cfg.model.vocab_size))


I believe the denominator should be float(self.cfg.model.d_model * emb_size). And I'm not sure about the numerator either... I don't see how this is equivalent to standard deviation since the summation terms in the norm are not centered by the mean, no?

update: @AkshitaB and I discussed this, we think we need to calculate this metric separately in optim.py.

We also talked about how this standard deviation will be a little biased since it will include parts of the embedding that never are never used, since we inflate the embedding size beyond vocab size to be a multiple of 128. But this is probably okay since that's only a small part of the embeddings.

Actually, I think this is a big problem. Embeddings will want to be small, so this will push them up. Unused, or rarely used embeddings will never get updated, so they will get bigger and bigger, skewing the calculation of the stddev more and more.

Figuring out which embeddings to exclude from the stddev computation is going to be tricky in the distributed setting though.

Thinking out loud here... what if we force the unused params to be zero from the beginning? They would still bias standard deviation by as much as they are different from the mean, but they would always be zero.. I think

That would work if we were starting with this from scratch, but what about the case when we want to use this to "rescue" a run? Can we explicitly make the unused embeddings zero when we load the model? And will it matter if we do so halfway through training?

Can we explicitly make the unused embeddings zero when we load the model?

I think that's our best bet. I can't think of any issues that would introduce in the middle of training. I suspect those parameters are 0 anyway due to weight decay and zero gradients.

Rare tokens would still be an issue, but not any more than they always are.

AkshitaB added 2 commits May 3, 2024 14:50

reverse weight decay

6240dc9

bug fix

0f5e28f

dirkgr requested changes May 3, 2024

View reviewed changes

olmo/train.py Outdated Show resolved Hide resolved

AkshitaB added 2 commits May 3, 2024 21:01

std, not norm

b7dc57e

right config key

1fc07cd

AkshitaB added 5 commits May 6, 2024 09:10

pow, not bitwise

4c5c4b1

always compute param norm if reverse decay

49a6f83

only for embedding group

9fae31a

make mypy happy

d6d5345

changelog

70d12b8

AkshitaB requested a review from dirkgr May 6, 2024 17:15

dirkgr and others added 2 commits May 6, 2024 11:41

Merge branch 'main' into reverse-decay

2f8beef

isort

d2f6ea2

dirkgr reviewed May 6, 2024

View reviewed changes

olmo/config.py Outdated Show resolved Hide resolved

AkshitaB added 2 commits May 6, 2024 12:32

rename to regularize_embeddings

717925e

change docstring

962b983

dirkgr requested a review from epwalsh May 6, 2024 20:18

dirkgr requested changes May 6, 2024

View reviewed changes

update changelog

465d143

epwalsh reviewed May 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse weight decay #567

Reverse weight decay #567

AkshitaB commented May 3, 2024 •

edited

Loading

dirkgr left a comment

dirkgr commented May 3, 2024 via email

AkshitaB commented May 6, 2024

dirkgr commented May 6, 2024

dirkgr May 6, 2024

dirkgr May 6, 2024

dirkgr May 6, 2024

dirkgr May 6, 2024

dirkgr May 6, 2024

epwalsh May 6, 2024

epwalsh May 6, 2024

dirkgr May 6, 2024

AkshitaB May 6, 2024

epwalsh May 6, 2024 •

edited

Loading

AkshitaB May 7, 2024 •

edited

Loading

epwalsh May 7, 2024 •

edited

Loading

dirkgr May 7, 2024

		if group["name"] == "embedding_decay_group":
		group["weight_decay"] *= emb_decay_factor

Reverse weight decay #567

Are you sure you want to change the base?

Reverse weight decay #567

Conversation

AkshitaB commented May 3, 2024 • edited Loading

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr commented May 3, 2024 via email

AkshitaB commented May 6, 2024

dirkgr commented May 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epwalsh May 6, 2024 • edited Loading

Choose a reason for hiding this comment

AkshitaB May 7, 2024 • edited Loading

Choose a reason for hiding this comment

epwalsh May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AkshitaB commented May 3, 2024 •

edited

Loading

epwalsh May 6, 2024 •

edited

Loading

AkshitaB May 7, 2024 •

edited

Loading

epwalsh May 7, 2024 •

edited

Loading