Removing assignment of service-account for launcher #1898

rpemsel · 2023-09-04T05:28:21Z

What this PR does / why we need it:

Which issue(s) this PR fixes

This PR fixes the following issue: #1897

Checklist:

Docs included if any changes are user facing

google-cla · 2023-09-04T05:28:24Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

rpemsel · 2023-09-05T06:55:29Z

Are the tests maybe a bit flaky? Tests that cover code I did not touch have failed, I think they are not related to my changes at all.

tenzen-y · 2023-09-07T05:49:14Z

@rpemsel Thanks for creating this PR. The failed test doesn't seem to be flakiness.

#1905

Can you investigate the causes and fix the errors? thanks.

rpemsel · 2023-09-08T09:52:03Z

Hi @tenzen-y,

thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.

Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.

Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.

I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

tenzen-y · 2023-09-08T11:16:15Z

Hi @tenzen-y,

thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.

Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.

Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.

I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

@rpemsel Thanks for the result of your investigation! Maybe, this test failure is caused by there is this commit behind the master branch. Can you rebase this PR? Thanks for your effort.

…auncher-service-account

google-oss-prow · 2023-09-08T12:07:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rpemsel
Once this PR has been reviewed and has the lgtm label, please assign zw0610 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rpemsel · 2023-09-08T12:08:18Z

Hi @tenzen-y,
thanks for your consideration regarding the flakiness of tests. I cannot share that. I invested quite a lot of time in order to make the tests work on my local machine with the changes of my pull reques. For me the integration tests have succeeded, those which failed, failed because the underlying workload that were executed (basic ML trainings) simply did not finish in time because my machine is not powerful enough.
Because you can see the pods being running before they are being terminated, you can assume that the same things happen on the Github runners.
Also the errors in the Github tests are not related to my changes in the MPI jobs, the problem is within a test for the reconciler - I did not change anything there. The test job is a TensorFlow (TF) job. Looking into the logs it looks like a race conditions between two tests - one of the tests seem to destroy the namespace in which then no updates can happen anymore.
I do not see it as my responsibility to fix the test stability of your application. How can we continue here?

@rpemsel Thanks for the result of your investigation! Maybe, this test failure is caused by there is this commit behind the master branch. Can you rebase this PR? Thanks for your effort.

I rebased on master but except for the volcano gang scheduler, not other changes were introduced

johnugeorge · 2023-09-08T18:17:25Z

Tests are failing consistently

rpemsel · 2023-09-08T18:29:02Z

Tests are failing consistently

Yes, they are, we can also say that tests are failing on a deterministic base. I have already explained, why they are failing. By the way: If you look to all the other Pull Request: They are all failing.

tenzen-y · 2023-09-08T18:34:09Z

If you look to all the other Pull Request: They are all failing.

Integration tests succeeded in #1905.

rpemsel · 2023-09-08T18:44:59Z

If you look to all the other Pull Request: They are all failing.

Integration tests succeeded in #1905.

Yes, for one of 11 pull requests tests are succeeding. I don't think that all the other are wrong.

Please run the tests locally on your system and you will exactly see the effects that I told you today. It would be much easier to debug the tests if they would output the logs from the deployed Kubernetes workload from time to time, so one would have an idea what is going wrong. Then it would maybe even be possible to get an idea of what is going wrong by just looking at the output of the CICD pipeline.

tenzen-y · 2023-09-08T20:32:35Z

I suspect that this change breaks the whole controller logic or integration tests since I faced the error once I applied the change the same as this PR to #1905.

tenzen-y · 2023-09-08T20:33:42Z

Also, integration tests succeeded in #1907.

rpemsel · 2023-09-11T05:40:49Z

I suspect that this change breaks the whole controller logic or integration tests since I faced the error once I applied the change the same as this PR to #1905.

I don't agree to this suspection: If I ran the e2e tests from main branch locally, then the tests still fail. Unfortunately I cannot run the tests from Github Actions.

Again I am asking you, how my change can break pytorch and mxjob jobs, if my change only applies to mpijobs? That does not make any sense to me.

Do you want me to send over the details of how to run the tests locally, so you can reproduce my results?

rpemsel · 2023-09-18T14:30:58Z

Sorry, but the tests are flaky/inconsistent in the CICD pipeline: This time all Go Tests succeeded, before always a single test failed, I did not change there.

I increased the timeouts for the integration tests. They are succeeding locally, but still fail on the CICD pipeline. Please validate for yourself.

andreyvelich · 2023-09-18T15:22:39Z

Hi @rpemsel, please can you try to rebase your PR and remove increased timeout for E2Es.
Usually, if tests stuck timeout won't help resolving it.
Then, we can verify if your change somehow affect MPIJob E2Es.

coveralls · 2023-09-19T04:51:34Z

Pull Request Test Coverage Report for Build 6231245371

0 of 0 changed or added relevant lines in 0 files are covered.
9 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.1%) to 42.667%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	9	61.54%

Totals
Change from base Build 6204968357:	-0.1%
Covered Lines:	3727
Relevant Lines:	8735

💛 - Coveralls

rpemsel · 2023-09-19T05:08:21Z

Hi @rpemsel, please can you try to rebase your PR and remove increased timeout for E2Es. Usually, if tests stuck timeout won't help resolving it. Then, we can verify if your change somehow affect MPIJob E2Es.

I have rebased to the latest state of master and also reset the timeouts.

andreyvelich · 2023-09-19T11:34:03Z

pkg/controller.v1/mpi/mpijob_controller.go

@@ -1035,7 +1035,6 @@ func (jc *MPIJobReconciler) newLauncher(mpiJob *kubeflowv1.MPIJob, kubectlDelive
 		jc.PodGroupControl.DecoratePodTemplateSpec(podSpec, mpiJob, rt)
 	}

-	podSpec.Spec.ServiceAccountName = launcherName


I just tried to build Training Operator image with your change and MPIJob test failed for me:

pytest sdk/python/test/e2e/test_e2e_mpijob.py --log-cli-level=info --namespace=default -k "test_sdk_e2e"

I think, we can't just remove SA assignment for MPIJob launcher.
MPIJob launcher requires the appropriate RBAC to exec and access MPIJob worker pods.
Thus, we attach the created ServiceAccount to MPIJob launcher.

MPI Operator experts can comment on this @alculquicondor @tenzen-y @terrytangyuan

It doesn't sound right to remove the SA.

That said, we got rid of it in v2, because we don't use kubectl exec

Correcting author

…/training-operator into launcher-service-account

rpemsel · 2023-09-20T05:35:57Z

I took up your reviews and avoided removing the service account. Instead I created another solution, but I will need to open a new pull request, since there is a problem with the author that I committed with for which I will not be able to sign the CLA

Removing assignment of service-account for launcher

c26fbe2

google-oss-prow bot requested review from jinchihe and kuizhiqing September 4, 2023 05:28

google-oss-prow bot added the size/XS label Sep 4, 2023

rpemsel and others added 2 commits September 8, 2023 14:06

Removing assignment of service-account for launcher

8c42234

Merge remote-tracking branch 'origin/launcher-service-account' into l…

8783adc

…auncher-service-account

google-oss-prow bot added size/S and removed size/XS labels Sep 8, 2023

rpemsel added 2 commits September 11, 2023 15:35

Increasing timeout in order to avoid e2e test failures

337f809

Extending maximum runtime of integration tests

a071eea

Merge branch 'kubeflow:master' into launcher-service-account

fc21a95

google-oss-prow bot added size/XS and removed size/S labels Sep 19, 2023

Resetting timeout values for E2E tests

4de96e7

andreyvelich reviewed Sep 19, 2023

View reviewed changes

Using service account name if provided, otherwise create default

e0976ee

google-oss-prow bot added size/S and removed size/XS labels Sep 20, 2023

rpemsel and others added 4 commits September 20, 2023 07:16

Resetting timeout values for E2E tests

933e428

Correcting author

Merge branch 'launcher-service-account' of https://github.com/rpemsel…

4aa63c5

…/training-operator into launcher-service-account

Using service account name if provided, otherwise create default

e3ca98c

Merge branch 'launcher-service-account' of https://github.com/rpemsel…

bdd4b22

…/training-operator into launcher-service-account

rpemsel closed this Sep 20, 2023

rpemsel deleted the launcher-service-account branch September 20, 2023 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing assignment of service-account for launcher #1898

Removing assignment of service-account for launcher #1898

rpemsel commented Sep 4, 2023

google-cla bot commented Sep 4, 2023

rpemsel commented Sep 5, 2023

tenzen-y commented Sep 7, 2023

rpemsel commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

google-oss-prow bot commented Sep 8, 2023

rpemsel commented Sep 8, 2023

johnugeorge commented Sep 8, 2023

rpemsel commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

rpemsel commented Sep 8, 2023 •

edited

Loading

tenzen-y commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

rpemsel commented Sep 11, 2023

rpemsel commented Sep 18, 2023

andreyvelich commented Sep 18, 2023 •

edited

Loading

coveralls commented Sep 19, 2023

rpemsel commented Sep 19, 2023

andreyvelich Sep 19, 2023

alculquicondor Sep 19, 2023

rpemsel commented Sep 20, 2023

Removing assignment of service-account for launcher #1898

Removing assignment of service-account for launcher #1898

Conversation

rpemsel commented Sep 4, 2023

google-cla bot commented Sep 4, 2023

rpemsel commented Sep 5, 2023

tenzen-y commented Sep 7, 2023

rpemsel commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

google-oss-prow bot commented Sep 8, 2023

rpemsel commented Sep 8, 2023

johnugeorge commented Sep 8, 2023

rpemsel commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

rpemsel commented Sep 8, 2023 • edited Loading

tenzen-y commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

rpemsel commented Sep 11, 2023

rpemsel commented Sep 18, 2023

andreyvelich commented Sep 18, 2023 • edited Loading

coveralls commented Sep 19, 2023

Pull Request Test Coverage Report for Build 6231245371

💛 - Coveralls

rpemsel commented Sep 19, 2023

andreyvelich Sep 19, 2023

Choose a reason for hiding this comment

alculquicondor Sep 19, 2023

Choose a reason for hiding this comment

rpemsel commented Sep 20, 2023

rpemsel commented Sep 8, 2023 •

edited

Loading

andreyvelich commented Sep 18, 2023 •

edited

Loading