Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize CUDA runners and jobs #829

Merged
merged 3 commits into from
Aug 10, 2022
Merged

Stabilize CUDA runners and jobs #829

merged 3 commits into from
Aug 10, 2022

Conversation

pavlovic-ivan
Copy link
Contributor

@pavlovic-ivan pavlovic-ivan commented Jul 25, 2022

This PR introduces AWS as a platform on which runners will run. It also fixes the test-library.runs-on issue, in a way that runners don't steal jobs that does not belong to them. To put it in simple terms:

  • job A deploys runner A
  • job B deploys runner B
  • runner A runs job B
  • runner B runs job A
  • if job A is finished, destruction of runner A is triggered
  • once runner A is canceled, that was running job B, this is then visible as "Canceled operation" in the workflow

Issue above is now fixed with the refactored test-library.runs-on labels.

Prerequisites:

  • updating/adding repo secrets
  • rebase from pavlovic-ivan/ephemeral-github-runner
  • rebase from pavlovic-ivan/ephemeral-github-runner-image (workflows triggered here can be canceled, images already exist in AWS)
  • rebase from pavlovic-ivan/ghrunner-app-gcp and redeploy the app

@jgiannuzzi @m4rs-mt we can do a session again on this to setup everything

Working examples:

Secrets to add/update:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION
  • PULUMI_BACKEND_URL

@m4rs-mt m4rs-mt merged commit 89e1a1d into m4rs-mt:master Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants