32 b #121

dirkgr · 2024-12-10T01:06:31Z

No description provided.

This reverts commit 4bb5d5c.

2015aroras · 2024-12-10T01:58:43Z

src/olmo_core/train/callbacks/evaluator_callback.py

@@ -130,7 +130,7 @@ def build(self, trainer: "Trainer") -> Optional[Callback]:
        eval_batch_size = (
            self.eval_batch_size
            if self.eval_batch_size is not None
-            else trainer.rank_microbatch_size * get_world_size(trainer.dp_process_group)
+            else 2 * trainer.rank_microbatch_size * get_world_size(trainer.dp_process_group)


nit: you could instead passed an updated evaluator callback in OLMo2-32B.py:

.with_callback( "lm_evaluator", LMEvaluatorCallbackConfig( eval_batch_size=<whatever you want>, eval_dataset=NumpyDatasetConfig.from_data_mix( DataMix.v3_small_ppl_validation, name=NumpyDatasetType.padded_fsl, mix_base_dir=root_dir, sequence_length=dataset_config.effective_sequence_length, tokenizer=tokenizer_config, work_dir=get_work_dir(root_dir), ), eval_interval=1000, ),

Yeah, but I think this is better. I think we can default to 2x the training batch size. It should always work.

epwalsh · 2024-12-10T03:12:09Z

src/scripts/train/OLMo2-32B.py

    return (
        TrainerConfig(
-            save_folder=common.save_folder,
+            save_folder=f"gs://ai2-llm/checkpoints/{project_name}/",


Why change this?

It defaults to something under my name? Not what we want for an official run?

Especially if we swap babysitting responsibilities during the run

epwalsh · 2024-12-10T03:13:27Z

src/olmo_core/nn/functional/cross_entropy_loss.py

-    #  import flash_attn.ops.triton.cross_entropy as flash_attn_ce  # type: ignore
-
-    _fused_cross_entropy_loss = triton_ce_loss.cross_entropy_loss
+    import flash_attn.ops.triton.cross_entropy as flash_attn_ce  # type: ignore


Our in-house triton CE loss was copied directly from the flash-attn repo, so I don't see the point of this.

Ok, I took this back out.

Do I want compiling and fused loss at the same time?

epwalsh · 2024-12-10T03:13:46Z

src/olmo_core/nn/transformer/config.py

        """
+        d_model = 5120


this is a very narrow model then... are you sure about that?

It's a clone of Qwen 32. The tradeoffs are, narrow d_model, wide FFN, GQA, lots of layers.

epwalsh · 2024-12-10T03:14:42Z

src/scripts/train/OLMo2-32B.py

+            fused_loss=True,
+            compile_loss=False,


I understand the trepidation about the different loss implementations, but the way it was before was the most performant. This way will be slower and have a higher memory footprint.

Can we have some certainty that this will do the right thing? What happens if we take the 13B from a late checkpoint and run it?

epwalsh · 2024-12-10T03:15:37Z

src/scripts/train/OLMo2-32B.py

                enabled=False,
                cancel_check_interval=10,
            ),
+        ).with_callback(


We should just add this to the common callbacks.

OLMo-core/src/olmo_core/internal/experiment.py

Line 163 in 5de774f

"lm_evaluator": LMEvaluatorCallbackConfig(

I don't know that we want these for everything. Default should probably be only the new, blessed ones.

This reverts commit 011113e.

This reverts commit c277d54. # Conflicts: # src/scripts/train/OLMo2-32B.py

2015aroras · 2024-12-10T22:46:24Z

src/scripts/train/OLMo2-32B.py

@@ -85,10 +94,57 @@ def build_trainer_config(common: CommonComponents) -> TrainerConfig:
            WandBCallback(
                name=common.run_name,
                entity="ai2-llm",
-                project="OLMo-core-26B",
+                project=project_name,
                enabled=False,


Intentionally disabled still? Just checking

…beaker-py version

This reverts commit 7852e1e.

This reverts commit fb2a274.

… can do it.

…y way we can do it." This reverts commit e49d4b7.

This reverts commit 4804004.

dirkgr added 13 commits December 7, 2024 23:59

Save more often

b94e702

Don't check for cancelation all the time

368abb8

Make sure we use the same CE loss that we used for the 13B

c277d54

We're going to 5T!

7c74d8b

We can live with a bigger eval batch size.

53d61fe

Add MMLU downstream eval

514abb8

Module isn't callable

011113e

Qwen-ish

2577397

Make model bigger

93637a1

It's now a 32B.

784377d

6T tokens

eec7e10

Official save folder

bd5edee

6.5T tokens

f516f09

dirkgr requested a review from epwalsh December 10, 2024 01:06

dirkgr added 4 commits December 9, 2024 17:06

Merge remote-tracking branch 'origin/main' into 32B

49264f5

Merged

4bb5d5c

Change project name and location

1ff1371

Revert "Merged"

4375612

This reverts commit 4bb5d5c.

2015aroras reviewed Dec 10, 2024

View reviewed changes

epwalsh reviewed Dec 10, 2024

View reviewed changes

dirkgr added 6 commits December 9, 2024 23:49

Revert "Module isn't callable"

20b9b08

This reverts commit 011113e.

Revert "Make sure we use the same CE loss that we used for the 13B"

7736198

This reverts commit c277d54. # Conflicts: # src/scripts/train/OLMo2-32B.py

We still want it fused!

8e0613f

One-in-two activation checkpointing

5652953

Merge remote-tracking branch 'origin/main' into 32B

323c786

Smaller microbatch

4f676e2

Wrap 3 in 4 blocks

d4e63fa

2015aroras reviewed Dec 10, 2024

View reviewed changes

dirkgr and others added 24 commits December 10, 2024 15:24

Don't compile the loss.

7c22386

Turn off broken eval

f38bff4

Go back to mbsz of 4

3bf2440

Set drop_last for DownstreamEvaluator to False

ab5afcf

Bring back Copa now that we have Shane's fix

47f9545

Merge remote-tracking branch 'origin/32B' into 32B

ee6aa90

Check if beaker loading issues are due to beaker changes by updating …

c656a41

…beaker-py version

Try hsdp with 2 nodes per replica

7852e1e

Revert "Try hsdp with 2 nodes per replica"

b19e76d

This reverts commit 7852e1e.

Try activation checkpointing 3 in 4

a02dd95

Try activation checkpointing 3 in 4 + all feedforwards checkpointed

6eaa5a3

Decrease microbatch size

b2a07de

Try activation checkpointing on just feed forwards

9985d31

Fix name

4cc6a62

Try to run with hybrid sharding.

1060499

More batch

fb2a274

Revert "More batch"

1073613

This reverts commit fb2a274.

There is something wrong with how the common object is set up.

c553b98

We need a less sharded checkpoint and I guess this is the only way we…

e49d4b7

… can do it.

Revert "We need a less sharded checkpoint and I guess this is the onl…

9608482

…y way we can do it." This reverts commit e49d4b7.

Async checkpointer may have problems with large checkpoints?

4804004

For loading checkpoints, it seems we need a longer timeout

fd4edb8

Revert "Async checkpointer may have problems with large checkpoints?"

1f79446

This reverts commit 4804004.

Flight to safety

072c616

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

32 b #121

32 b #121

dirkgr commented Dec 10, 2024

2015aroras Dec 10, 2024

dirkgr Dec 10, 2024

epwalsh Dec 10, 2024

dirkgr Dec 10, 2024

dirkgr Dec 10, 2024

epwalsh Dec 10, 2024

dirkgr Dec 10, 2024

dirkgr Dec 10, 2024

epwalsh Dec 10, 2024

dirkgr Dec 10, 2024

epwalsh Dec 10, 2024

dirkgr Dec 10, 2024

epwalsh Dec 10, 2024

dirkgr Dec 10, 2024

2015aroras Dec 10, 2024

32 b #121

Are you sure you want to change the base?

32 b #121

Conversation

dirkgr commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment