Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPK cleanup: integ tests and code cleanup #121

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 3 additions & 9 deletions .github/workflows/build_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ on:

env:
# Names must be unique in parallel running tests.
TPU_CLUSTER_NAME: build-xpk-2-v4-8-nodepools
TPU_CLUSTER_NAME: build-xpk-2-v4-8-nodepools-${{ github.run_id }}-${{ github.run_attempt }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_attempt may uniquely identify the cluster name, do you still think run_id is needed? https://stackoverflow.com/questions/54310050/how-to-version-build-artifacts-using-github-actions

WORKLOAD_NAME: xpktest-build-${{ github.run_attempt }}
PATHWAYS_WORKLOAD_NAME: xpkpw-build-${{ github.run_attempt }}

jobs:
cluster-create-and-delete:
tpu-cluster-workload-workflow:
runs-on: [ubuntu-20.04]
concurrency: # We support one build or nightly test to run at a time currently.
Copy link
Collaborator

@RoshaniN RoshaniN Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know how "one of build or nightly" is ensured. I think the concurrency of jobs is once per group.

concurrency: # We support one build test to run at a time currently.
group: build-test-cluster-group
cancel-in-progress: false
steps:
Expand Down Expand Up @@ -70,9 +70,3 @@ jobs:
- name: Delete the cluster created
if: always()
run: python xpk.py cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b






97 changes: 83 additions & 14 deletions .github/workflows/nightly_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,16 @@ on:

env:
# Names must be unique in parallel running tests.
EMPTY_CLUSTER_NAME: nightly-xpk-zero-nodepools
TPU_CLUSTER_NAME: nightly-xpk-2-v4-8-nodepools
EMPTY_CLUSTER_NAME: nightly-xpk-zero-nodepools-${{ github.run_id }}-${{ github.run_attempt }}
TPU_CLUSTER_NAME: nightly-xpk-2-v4-8-nodepools-${{ github.run_id }}-${{ github.run_attempt }}
PATHWAYS_TPU_CLUSTER_NAME: pw-nightly-test-2-v4-8-nodepools-${{ github.run_id }}-${{ github.run_attempt }}
AUTOPROVISION_CLUSTER_NAME: autoprovision-nightly-test
WORKLOAD_NAME: xpktest-nightly-${{ github.run_attempt }}
PATHWAYS_TPU_CLUSTER_NAME: pw-nightly-test-2-v4-8-nodepools
PATHWAYS_WORKLOAD_NAME: xpkpw-nightly-${{ github.run_attempt }}

jobs:
cluster-create-and-delete:
tpu-cluster-workload-workflow:
runs-on: [ubuntu-20.04]
concurrency: # We support one build test to run at a time currently.
concurrency: # We support one build per job to run at a time currently.
group: nightly-test-cluster-group
cancel-in-progress: false
steps:
Expand Down Expand Up @@ -71,7 +71,83 @@ jobs:
- name: Delete the cluster created
if: always()
run: python xpk.py cluster delete --cluster $TPU_CLUSTER_NAME --zone=us-central2-b

command-help-test:
runs-on: [ubuntu-20.04]
concurrency: # We support one build test to run at a time currently.
group: nightly-command-help-test-cluster-group
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- uses: 'google-github-actions/auth@v2'
with:
credentials_json: '${{ secrets.GCP_SA_KEY }}'
- uses: google-github-actions/setup-gcloud@v2
with:
version: '>= 363.0.0'
install_components: 'beta,gke-gcloud-auth-plugin'
- name: Verify gcp setup
run: gcloud info
- name: XPK Help
run: python3 xpk.py --help
- name: XPK Cluster Help
run: python3 xpk.py cluster --help
- name: XPK Cluster Create Help
run: python3 xpk.py cluster create --help
- name: XPK Cluster Delete Help
run: python3 xpk.py cluster delete --help
- name: XPK Cluster Describe Help
run: python3 xpk.py cluster describe --help
- name: XPK Workload Help
run: python3 xpk.py workload --help
- name: XPK Workload Create Help
run: python3 xpk.py workload create --help
- name: XPK Workload Delete Help
run: python3 xpk.py workload delete --help
- name: XPK Workload List Help
run: python3 xpk.py workload list --help
- name: XPK Inspector Help
run: python3 xpk.py inspector --help
xpk-tpu-autoprovisioning-test:
runs-on: [ubuntu-20.04]
concurrency: # We support one build test to run at a time currently.
group: nightly-autoprovisioning-test-cluster-group
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- uses: 'google-github-actions/auth@v2'
with:
credentials_json: '${{ secrets.GCP_SA_KEY }}'
- uses: google-github-actions/setup-gcloud@v2
with:
version: '>= 363.0.0'
install_components: 'beta,gke-gcloud-auth-plugin'
- name: Create an auto-provisioned enabled XPK Cluster with 2 x v4-8 nodepools
run: python xpk.py cluster create --cluster $AUTOPROVISION_CLUSTER_NAME --enable-autoprovisioning --device-type=v4-8 --num-slices=2 --zone=us-central2-b --default-pool-cpu-machine-type=n1-standard-16 --reservation='${{ secrets.GCP_TPU_V4_RESERVATION }}' --custom-cluster-arguments='${{ secrets.CLUSTER_ARGUMENTS }}'
- name: Authenticate Docker
run: gcloud auth configure-docker --quiet
- name: Create test script to execute in workloads
run: echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh
- name: Run a 2x v4-8 workload on Ubuntu base image
run: python xpk.py workload create --cluster $AUTOPROVISION_CLUSTER_NAME --workload $WORKLOAD_NAME --tpu-type=v4-8 --num-slices=2 --zone=us-central2-b --command "bash test.sh"
- name: Wait for 2x v4-8 workload completion and confirm it succeeded
run: python3 xpk.py workload list --cluster $AUTOPROVISION_CLUSTER_NAME --zone=us-central2-b --wait-for-job-completion $WORKLOAD_NAME --timeout 300
- name: Run a 1x v4-16 workload
run: python xpk.py workload create --cluster $AUTOPROVISION_CLUSTER_NAME --workload ${WORKLOAD_NAME}-v4-16 --tpu-type=v4-16 --num-slices=1 --zone=us-central2-b --command "bash test.sh"
- name: Wait for 2x v4-8 workload completion and confirm it succeeded. Give 20 minutes to allow the node pools to re-provision.
run: python3 xpk.py workload list --cluster $AUTOPROVISION_CLUSTER_NAME --zone=us-central2-b --wait-for-job-completion ${WORKLOAD_NAME}-v4-16 --timeout 1200
- name: Delete the 2x v4-8 workload on the cluster
run: python3 xpk.py workload delete --workload $WORKLOAD_NAME --cluster $AUTOPROVISION_CLUSTER_NAME --zone=us-central2-b
- name: Delete the 1x v4-16 workload on the cluster
run: python3 xpk.py workload delete --workload ${WORKLOAD_NAME}-v4-16 --cluster $AUTOPROVISION_CLUSTER_NAME --zone=us-central2-b
- name: Delete the Pathways cluster created
if: always()
run: python xpk.py cluster delete --cluster $AUTOPROVISION_CLUSTER_NAME --zone=us-central2-b
pw-cluster-and-workload:
runs-on: [ubuntu-20.04]
concurrency: # We support one build test to run at a time currently.
Expand Down Expand Up @@ -102,10 +178,3 @@ jobs:
- name: Delete the Pathways cluster created
if: always()
run: python xpk.py cluster delete --cluster $PATHWAYS_TPU_CLUSTER_NAME --zone=us-central2-b







Loading
Loading