Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing assignment of service-account for launcher #1898

Closed
wants to merge 12 commits into from

Conversation

rpemsel
Copy link
Contributor

@rpemsel rpemsel commented Sep 4, 2023

What this PR does / why we need it:

Which issue(s) this PR fixes

This PR fixes the following issue: #1897

Checklist:

  • Docs included if any changes are user facing

@google-cla
Copy link

google-cla bot commented Sep 4, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 5, 2023

Are the tests maybe a bit flaky? Tests that cover code I did not touch have failed, I think they are not related to my changes at all.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 7, 2023

@rpemsel Thanks for creating this PR. The failed test doesn't seem to be flakiness.

#1905

Can you investigate the causes and fix the errors? thanks.

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 8, 2023

Hi @tenzen-y,

thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.

Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.

Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.

I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

@tenzen-y
Copy link
Member

tenzen-y commented Sep 8, 2023

Hi @tenzen-y,

thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.

Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.

Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.

I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

@rpemsel Thanks for the result of your investigation! Maybe, this test failure is caused by there is this commit behind the master branch. Can you rebase this PR? Thanks for your effort.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rpemsel
Once this PR has been reviewed and has the lgtm label, please assign zw0610 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Sep 8, 2023
@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 8, 2023

Hi @tenzen-y,
thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.
Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.
Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.
I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

@rpemsel Thanks for the result of your investigation! Maybe, this test failure is caused by there is this commit behind the master branch. Can you rebase this PR? Thanks for your effort.

I rebased on master but except for the volcano gang scheduler, not other changes were introduced

@johnugeorge
Copy link
Member

Tests are failing consistently

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 8, 2023

Tests are failing consistently

Yes, they are, we can also say that tests are failing on a deterministic base. I have already explained, why they are failing. By the way: If you look to all the other Pull Request: They are all failing.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 8, 2023

If you look to all the other Pull Request: They are all failing.

Integration tests succeeded in #1905.

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 8, 2023

If you look to all the other Pull Request: They are all failing.

Integration tests succeeded in #1905.

Yes, for one of 11 pull requests tests are succeeding. I don't think that all the other are wrong.

Please run the tests locally on your system and you will exactly see the effects that I told you today. It would be much easier to debug the tests if they would output the logs from the deployed Kubernetes workload from time to time, so one would have an idea what is going wrong. Then it would maybe even be possible to get an idea of what is going wrong by just looking at the output of the CICD pipeline.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 8, 2023

I suspect that this change breaks the whole controller logic or integration tests since I faced the error once I applied the change the same as this PR to #1905.

@tenzen-y
Copy link
Member

tenzen-y commented Sep 8, 2023

Also, integration tests succeeded in #1907.

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 11, 2023

I suspect that this change breaks the whole controller logic or integration tests since I faced the error once I applied the change the same as this PR to #1905.

I don't agree to this suspection: If I ran the e2e tests from main branch locally, then the tests still fail. Unfortunately I cannot run the tests from Github Actions.

Again I am asking you, how my change can break pytorch and mxjob jobs, if my change only applies to mpijobs? That does not make any sense to me.

Do you want me to send over the details of how to run the tests locally, so you can reproduce my results?

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 18, 2023

Sorry, but the tests are flaky/inconsistent in the CICD pipeline: This time all Go Tests succeeded, before always a single test failed, I did not change there.

I increased the timeouts for the integration tests. They are succeeding locally, but still fail on the CICD pipeline. Please validate for yourself.

@andreyvelich
Copy link
Member

andreyvelich commented Sep 18, 2023

Hi @rpemsel, please can you try to rebase your PR and remove increased timeout for E2Es.
Usually, if tests stuck timeout won't help resolving it.
Then, we can verify if your change somehow affect MPIJob E2Es.

@google-oss-prow google-oss-prow bot added size/XS and removed size/S labels Sep 19, 2023
@coveralls
Copy link

Pull Request Test Coverage Report for Build 6231245371

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 9 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.1%) to 42.667%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 9 61.54%
Totals Coverage Status
Change from base Build 6204968357: -0.1%
Covered Lines: 3727
Relevant Lines: 8735

💛 - Coveralls

@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 19, 2023

Hi @rpemsel, please can you try to rebase your PR and remove increased timeout for E2Es. Usually, if tests stuck timeout won't help resolving it. Then, we can verify if your change somehow affect MPIJob E2Es.

I have rebased to the latest state of master and also reset the timeouts.

@@ -1035,7 +1035,6 @@ func (jc *MPIJobReconciler) newLauncher(mpiJob *kubeflowv1.MPIJob, kubectlDelive
jc.PodGroupControl.DecoratePodTemplateSpec(podSpec, mpiJob, rt)
}

podSpec.Spec.ServiceAccountName = launcherName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to build Training Operator image with your change and MPIJob test failed for me:

pytest sdk/python/test/e2e/test_e2e_mpijob.py --log-cli-level=info --namespace=default -k "test_sdk_e2e"

I think, we can't just remove SA assignment for MPIJob launcher.
MPIJob launcher requires the appropriate RBAC to exec and access MPIJob worker pods.
Thus, we attach the created ServiceAccount to MPIJob launcher.

MPI Operator experts can comment on this @alculquicondor @tenzen-y @terrytangyuan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound right to remove the SA.

That said, we got rid of it in v2, because we don't use kubectl exec

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Sep 20, 2023
@rpemsel
Copy link
Contributor Author

rpemsel commented Sep 20, 2023

I took up your reviews and avoided removing the service account. Instead I created another solution, but I will need to open a new pull request, since there is a problem with the author that I committed with for which I will not be able to sign the CLA

@rpemsel rpemsel closed this Sep 20, 2023
@rpemsel rpemsel deleted the launcher-service-account branch September 20, 2023 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants