Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normal baselines #618

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c5913aa
7b normal baseline scripts
AkshitaB Jun 12, 2024
e2cd59b
add new evals
AkshitaB Jun 12, 2024
3d02325
add 1b config
AkshitaB Jun 12, 2024
995247f
1b scripts
AkshitaB Jun 12, 2024
b71dff9
turn off fused_loss
AkshitaB Jun 12, 2024
0de7234
fix name
AkshitaB Jun 12, 2024
75ae73f
make executable
AkshitaB Jun 12, 2024
ed51f61
temporarily don't run new evals
AkshitaB Jun 12, 2024
3293cbb
switch to pete's torch2.3 image
AkshitaB Jun 12, 2024
d77add5
no clipping warmup
AkshitaB Jun 12, 2024
eff21ee
wait longer
AkshitaB Jun 12, 2024
c1075ce
priority
AkshitaB Jun 12, 2024
1c58105
Load from checkpoint, and also more data loading workers
dirkgr Jun 12, 2024
cd04bad
Fewer nodes :-(
dirkgr Jun 12, 2024
63d2a26
start from scratch again
AkshitaB Jun 12, 2024
aa8ca13
restart 1b
AkshitaB Jun 12, 2024
4fb871a
add weka configs
AkshitaB Jun 12, 2024
f89437d
1b with weka and shard_grad_op
AkshitaB Jun 12, 2024
8083742
oops, actually use weka with 1b
AkshitaB Jun 12, 2024
1e7a243
7b with weka and more nodes
AkshitaB Jun 12, 2024
57cc378
weka credentials
AkshitaB Jun 12, 2024
fe716ec
use my creds for s3
AkshitaB Jun 12, 2024
910d25b
weka creds for 7b
AkshitaB Jun 12, 2024
8d273e2
reduce num workers
AkshitaB Jun 12, 2024
ad14ba8
fewer nodes again
AkshitaB Jun 12, 2024
c784548
go back to s3 with more nodes
AkshitaB Jun 12, 2024
fa8ffa0
more workers
AkshitaB Jun 12, 2024
1bbc892
back to 256
AkshitaB Jun 12, 2024
a7cbcd6
more nodes
AkshitaB Jun 13, 2024
db1151e
nope; 256 it is
AkshitaB Jun 13, 2024
700cea9
fix the lr schedule :(
AkshitaB Jun 13, 2024
faf5fb5
fix
AkshitaB Jun 13, 2024
d4e3716
baseline explicit configs
AkshitaB Jun 13, 2024
2c3e1e6
use newer configs
AkshitaB Jun 13, 2024
fd3ff93
add back the new evals
AkshitaB Jun 13, 2024
96c338c
weka fs
AkshitaB Jun 14, 2024
35b51ce
7b with wekafs
AkshitaB Jun 14, 2024
02b42f7
Make olmo-core checkpointer more robust on weka
epwalsh Jun 14, 2024
dbe1e15
spike debugging scripts
AkshitaB Jun 18, 2024
7b47d8b
remove line
AkshitaB Jun 18, 2024
6853be8
rerun but skip the problematic batch
AkshitaB Jun 18, 2024
9ab633b
more nodes
AkshitaB Jun 18, 2024
56e6c32
add nccl ib env var
AkshitaB Jun 18, 2024
8c7a35a
reduce num workers?
AkshitaB Jun 18, 2024
28bb493
256 nodes
AkshitaB Jun 19, 2024
7ed76d4
128 nodes
AkshitaB Jun 21, 2024
283a775
back to 256
AkshitaB Jun 21, 2024
ca7f696
Fewer nodes
dirkgr Jun 22, 2024
914a2b5
NCCL env var to try
dirkgr Jun 22, 2024
51ae8fa
Merge branch 'main' into normal-baselines
AkshitaB Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,302 changes: 1,302 additions & 0 deletions configs/llamaish1-normal-s3.yaml

Large diffs are not rendered by default.

1,302 changes: 1,302 additions & 0 deletions configs/llamaish1-normal-weka.yaml

Large diffs are not rendered by default.

1,297 changes: 1,297 additions & 0 deletions configs/llamaish1-weka.yaml

Large diffs are not rendered by default.

1,300 changes: 1,300 additions & 0 deletions configs/llamaish7-normal-s3.yaml

Large diffs are not rendered by default.

1,300 changes: 1,300 additions & 0 deletions configs/llamaish7-normal-weka.yaml

Large diffs are not rendered by default.

1,296 changes: 1,296 additions & 0 deletions configs/llamaish7-weka.yaml

Large diffs are not rendered by default.

41 changes: 41 additions & 0 deletions scripts/beaker/debug/llamaish7-normal-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=32

gantry run \
--workspace ai2/OLMo-training \
--task-name llamaish7-normal-spike-debug \
--description "OLMo medium - 7B - Llamaish Normal Spike Debug" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--synchronized-start-timeout 60m \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--shared-memory 10GiB \
--venv base \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/debug/llamaish7-normal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
54 changes: 54 additions & 0 deletions scripts/beaker/debug/llamaish7-normal.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env bash
set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

NUM_NODES=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# Warm HF cache
mkdir -p /root/.cache
pushd /root/.cache
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf -
popd
export HF_DATASETS_OFFLINE=1

# Move AWS credentials from env to relevant files
mkdir -p ~/.aws
printenv AWS_CONFIG > ~/.aws/config
printenv AWS_CREDENTIALS > ~/.aws/credentials

export EXPERIMENT=llamaish7-normal-final-spike-rerun-2

torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
--nproc-per-node 8 \
--rdzv_id=12347 \
--rdzv_backend=static \
--rdzv_endpoint=$BEAKER_LEADER_REPLICA_HOSTNAME:29400 \
--node_rank=$BEAKER_REPLICA_RANK \
--rdzv_conf="read_timeout=420" \
scripts/train.py \
configs/llamaish7-normal-weka.yaml \
--run_name=$EXPERIMENT \
--wandb.name=$EXPERIMENT \
--wandb.group=$EXPERIMENT \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=SHARD_GRAD_OP \
--save_folder=runs/ \
--activation_checkpointing=fine_grained \
--device_train_microbatch_size=2 \
--global_train_batch_size=1024 \
--save_interval=250 \
--eval_interval=250 \
--optimizer.metrics_log_interval=1 \
--save_overwrite \
--save_num_checkpoints_to_keep=3 \
--data.num_workers=64 \
--fast_forward_batches=1 \
--load_path=s3://ai2-llm/checkpoints/OLMo-medium/llamaish7-normal-final/step96750
41 changes: 41 additions & 0 deletions scripts/beaker/llamaish1-normal-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=8

gantry run \
--workspace ai2/OLMo-training \
--task-name llamaish1-normal \
--description "OLMo small - 1B - Llamaish Normal Weka" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--synchronized-start-timeout 20m \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--shared-memory 10GiB \
--venv base \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/llamaish1-normal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
53 changes: 53 additions & 0 deletions scripts/beaker/llamaish1-normal.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env bash
set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

NUM_NODES=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# Warm HF cache
mkdir -p /root/.cache
pushd /root/.cache
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf -
popd
export HF_DATASETS_OFFLINE=1

# Move AWS credentials from env to relevant files
mkdir -p ~/.aws
printenv AWS_CONFIG > ~/.aws/config
printenv AWS_CREDENTIALS > ~/.aws/credentials


export EXPERIMENT=llamaish1-normal-final

torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
--nproc-per-node 8 \
--rdzv_id=12347 \
--rdzv_backend=static \
--rdzv_endpoint=$BEAKER_LEADER_REPLICA_HOSTNAME:29400 \
--node_rank=$BEAKER_REPLICA_RANK \
--rdzv_conf="read_timeout=420" \
scripts/train.py \
configs/llamaish1-normal-weka.yaml \
--run_name=$EXPERIMENT \
--wandb.name=$EXPERIMENT \
--wandb.group=$EXPERIMENT \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=SHARD_GRAD_OP \
--save_folder=runs/ \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think there's any need to override this

Suggested change
--save_folder=runs/ \

--device_train_microbatch_size=4 \
--global_train_batch_size=512 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the default in the config, right?

Suggested change
--global_train_batch_size=512 \

--save_interval=250 \
--eval_interval=250 \
--optimizer.metrics_log_interval=1 \
--save_overwrite \
--save_num_checkpoints_to_keep=3 \
'--load_path=${path.last_checkpoint:s3://ai2-llm/checkpoints/OLMo-small/llamaish1-normal-final/}'
#--load_path=s3://ai2-llm/checkpoints/OLMo-small/llamaish1-normal-shard/step2000
14 changes: 11 additions & 3 deletions scripts/beaker/llamaish7-normal-launch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,23 @@

set -ex

NUM_NODES=64
NUM_NODES=16

gantry run \
--workspace ai2/OLMo-training \
--task-name llamaish7-normal-qk-norm-reorder-zloss \
--task-name llamaish7-normal \
--description "OLMo medium - 7B - Llamaish Normal" \
--priority urgent \
--preemptible \
--beaker-image shanea/olmo-torch2.3-gantry \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--synchronized-start-timeout 15m \
--env LOG_FILTER_TYPE=local_rank0_only \
Expand All @@ -26,6 +27,13 @@ gantry run \
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \
--env R2_PROFILE=R2 \
--env S3_PROFILE=S3 \
--env WEKA_PROFILE=WEKA \
--env-secret AWS_CONFIG=PETEW_AWS_CONFIG \
--env-secret AWS_CREDENTIALS=PETEW_AWS_CREDENTIALS \
--env-secret R2_ENDPOINT_URL=R2_ENDPOINT_URL \
--env-secret WEKA_ENDPOINT_URL=WEKA_ENDPOINT_URL \
--shared-memory 10GiB \
--venv base \
--yes \
Expand Down
26 changes: 15 additions & 11 deletions scripts/beaker/llamaish7-normal.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,20 @@ shift
# Warm HF cache
mkdir -p /root/.cache
pushd /root/.cache
curl "https://storage.googleapis.com/dirkgr-public/huggingface_cache_v3.tar.gz" | tar --keep-newer-files -xzf -
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf -
popd
export HF_DATASETS_OFFLINE=1

export EXPERIMENT=llamaish7-normal
# Move AWS credentials from env to relevant files
mkdir -p ~/.aws
printenv AWS_CONFIG > ~/.aws/config
printenv AWS_CREDENTIALS > ~/.aws/credentials

export EXPERIMENT=llamaish7-normal-final
# export NCCL_IB_HCA=^mlx5_bond
export NCCL_DEBUG=TRACE
export NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1


torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
Expand All @@ -29,26 +38,21 @@ torchrun \
--node_rank=$BEAKER_REPLICA_RANK \
--rdzv_conf="read_timeout=420" \
scripts/train.py \
configs/llamaish7-s3.yaml \
configs/llamaish7-normal-weka.yaml \
--run_name=$EXPERIMENT \
--wandb.name=$EXPERIMENT \
--wandb.group=$EXPERIMENT \
--model.flash_attention=true \
--fsdp.wrapping_strategy=by_block_and_size \
--fsdp.sharding_strategy=SHARD_GRAD_OP \
--save_folder=runs/ \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think there's any need to override this

Suggested change
--save_folder=runs/ \

--activation_checkpointing=fine_grained \
--fused_loss=true \
--device_train_microbatch_size=2 \
--global_train_batch_size=1024 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
--global_train_batch_size=1024 \

--save_interval=250 \
--eval_interval=250 \
--optimizer.metrics_log_interval=1 \
--save_overwrite \
--model.init_fn=normal \
--model.init_std=0.02 \
--model.clip_qkv=null \
--save_num_checkpoints_to_keep=3 \
--scheduler.units=steps \
--scheduler.t_warmup=2000
# '--load_path=${path.last_checkpoint:s3://ai2-llm/checkpoints/OLMo-medium/llamaish7-normal/}'
--data.num_workers=64 \
'--load_path=${path.last_checkpoint:s3://ai2-llm/checkpoints/OLMo-medium/llamaish7-normal-final/}'
#--load_path=s3://ai2-llm/checkpoints/OLMo-medium/llamaish7-normal/step2000
Loading