Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging Develop -> Main for sample_workloads changes #366

Merged
merged 54 commits into from
Mar 5, 2024
Merged

Conversation

Chris113113
Copy link
Collaborator

@Chris113113 Chris113113 commented Mar 4, 2024

Upstreaming the following pull requests:

#350 #357 #361 #362 #351 #354 #364 #363

stevenBorisko and others added 30 commits December 15, 2023 13:05
…ltest/gke

Fix NCCL_SOCKET_IFNAME typo in values.yaml under sample_workloads/nccltest/gke
Fix NCCL_SOCKET_IFNAME typo in values.yaml under sample_workloads/nccltest/gke
…tainer.sh

Replace `MODEL_NAME`, `GCS_EXPERIMENT_BUCKET`, and `EXPERIMENT_ROOT_DIR` with their environment variables.
Remove manual launch and closure of RxDM container since we expect users to use HPC Toolkit to deploy slurm cluster
Add a note regarding setting `ulimit -n 1048576` if orchestrators relies on SSH to launch processes to run communication patterns doing send-recvs between many GPU pairs
@Chris113113 Chris113113 merged commit 6ef47c4 into main Mar 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants