Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v1.4.2 #349

Merged
merged 43 commits into from
Dec 15, 2023
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
9676b57
main to develop (#325)
soumyapani Nov 6, 2023
c4f1240
Merging main to develop (#327)
soumyapani Nov 6, 2023
26e3b34
staging changes for litgpt
Chris113113 Nov 9, 2023
a76f635
Revert trainer code
Chris113113 Nov 9, 2023
67ef18b
Add microbatch params
Chris113113 Nov 9, 2023
d2935e1
Update to flash_attn functional commit
Chris113113 Nov 9, 2023
0557a9f
Set llama2 and 6,6 batch
Chris113113 Nov 9, 2023
d9d9132
Set precision
Chris113113 Nov 10, 2023
ec48223
Minor fixes in helm file
gkroiz Nov 10, 2023
06631a8
Bump flash-attn back to 2.0.4
Chris113113 Nov 11, 2023
64fc504
Add modelName param, remove mfu from output
Chris113113 Nov 11, 2023
2e2ba34
Remove readme for now
Chris113113 Nov 11, 2023
8c1adf3
LitGPT enhancements (new flash-attn) (#330)
Chris113113 Nov 11, 2023
d9d860e
Add litgpt readme
Chris113113 Nov 14, 2023
b31e9a8
Remove comments
Chris113113 Nov 14, 2023
959e60a
Address comments
Chris113113 Nov 14, 2023
26ba6d2
23.05 -> 23.09
Chris113113 Nov 14, 2023
b6dc51a
Add litgpt readme (#332)
Chris113113 Nov 14, 2023
8ed80a7
split gcsBucket var into two (data + exp Buckets)
gkroiz Nov 16, 2023
8ff8694
clone litgpt at d5d37 hash
gkroiz Nov 16, 2023
7d8188c
Small fixes + updates to README
gkroiz Nov 16, 2023
86d87bb
Clean empty lines
gkroiz Nov 16, 2023
dcddd5f
Fix "out" folder location so that it syncs to gcs
gkroiz Nov 17, 2023
088c14e
Remove extra line
gkroiz Nov 17, 2023
ef46259
Remove extra line
gkroiz Nov 17, 2023
6c5e3fe
Small fixes to Lit-GPT demo (#334)
gkroiz Nov 17, 2023
5a94575
remove profiling setup (currently not used)
gkroiz Nov 17, 2023
a3e72f2
remove profiling setup (currently not used) (#335)
gkroiz Nov 17, 2023
17bb362
Adding details to explain MFU calculation (#339)
parambole Nov 21, 2023
7c29031
Update rxdm image version
Chris113113 Dec 5, 2023
cf84951
Add readme update
Chris113113 Dec 5, 2023
846e81e
Update rxdm image version (#341)
Chris113113 Dec 5, 2023
606760a
Fix unsupported envvar are set for SLURM cluster #343 (#344)
parambole Dec 5, 2023
c74ba5a
Adding SLURM scripts to setup and launch lit-gpt training (#342)
parambole Dec 6, 2023
2cec121
Adding a simple Multi-Node Pingpong PyTorch Workload (#347)
parambole Dec 13, 2023
c9bf5fb
Set clustertype, fix cpu pinning
Chris113113 Dec 14, 2023
2f811f3
Pass num iters as param, fix gke CLUSTER_TYPE
Chris113113 Dec 14, 2023
31ded59
Whitespace fixes
Chris113113 Dec 14, 2023
804fa39
Fix sidecar termination, change warmup iters, move .example
Chris113113 Dec 14, 2023
77085c0
Update litgpt LKG, more params for injection (#348)
Chris113113 Dec 15, 2023
12c7ce5
Fix cleanup path
Chris113113 Dec 15, 2023
9696b8f
Add a nccl-test sample workload (#345)
Chris113113 Dec 15, 2023
9796582
Release v1.4.2
stevenBorisko Dec 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Minor fixes in helm file
  • Loading branch information
gkroiz committed Nov 10, 2023
commit ec4822381fc29f90be46221a39827d66aef6f0b3
4 changes: 2 additions & 2 deletions sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml
Original file line number Diff line number Diff line change
@@ -91,7 +91,7 @@ spec:
- "bash"
- "-c"
- |
/tcpgpudmarxd/build/app/tcpgpudmarxd --gpu_nic_preset a3vm --gpu_shmem_type fd --setup_param "--verbose 128 5 0" --uds_path "/run/tcpx" &
/tcpgpudmarxd/build/app/tcpgpudmarxd --gpu_nic_preset a3vm --gpu_shmem_type fd --setup_param "--verbose 128 5 0" &
while [ ! -e "/usr/share/litgpt/workload_terminated" ]; do sleep 10; done
securityContext:
privileged: true
@@ -150,7 +150,7 @@ spec:
- name: EXPERIMENT_ROOT_DIR
value: "{{$root.Values.workload.experimentDir}}"
- name: DATA_DIR
value: "{{$root.Values.workload.experimentDir}}"
value: "{{$root.Values.workload.dataDir}}"
- name: BATCH_SIZE
value: "{{$root.Values.workload.batchSize}}"
- name: MICRO_BATCH_SIZE