-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369
Conversation
1d20074
to
44d7630
Compare
Codecov Report
@@ Coverage Diff @@
## master #369 +/- ##
=======================================
Coverage 65.41% 65.41%
=======================================
Files 40 40
Lines 6314 6314
=======================================
Hits 4130 4130
Misses 2023 2023
Partials 161 161 Continue to review full report at Codecov.
|
Hi @yangwwei , do you have general pointers on how I can debug the failing CI checks? Running the e2e tests locally on this branch passes, but getting the applications in pre-commit checks seem to fail. |
44d7630
to
9650d01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale up and scale down are way to nice to the scheduler. We want to kill the pod that runs the scheduler and then see if the pod gets rescheduled again. The admission controller should not be killed.
The fact that the scheduler runs as a deployment should cause the scheduler pod to be recreated and scheduled again.
We can follow these high level steps:
kClient.GetPodNamesFromNS("YK Namespace")
get the pod named "yunikorn-scheduler-*"
kClient.DeletePod("YK scheduler", "YK Namespace")
Even after that is done and the delete pod has returned without an error we should see a new scheduler pod in ready state within a short amount of time.
9650d01
to
87d1195
Compare
Thanks @wilfred-s, incorporated your suggestions and ran e2e tests locally with success. Not sure why the pre-commit checks seem to fail on getting the applications in CI. Any way I can get access to the cluster these run on and debug? |
Sorry for the late response, I was on vacation last week so emails/notifications got backlogged. For this issue, @ronazhan can you share your thoughts? I am not entirely sure what is the best way to debug this. The error shown up there:
that seems like the REST endpoint could not be accessed. Maybe this is because the scheduler pod gets restarted and we need to redo the port-forwarding? Here is how it was done during the initial setup: https://github.com/apache/incubator-yunikorn-k8shim/blob/a61fc0052c07e510503853db74030290bbda562b/scripts/run-e2e-tests.sh#L193-L199. |
I don't know what happened with my last update. I must have closed a browser window without saving. I would also suggest that we move this restart test to its own test suite. Instead of making this part of basic scheduling and causing those tests to fail we should create a "recovery & restart" suite. Move this restart test in there and we can then extend it with e2e recover tests. Using some more advanced ginkgo tricks we can then make sure these restart or recovery tests are run separate from all the others. Adding labels and splitting the ginkgo run into two runs would allow us to be more destructive in tests. I do think that it requires gingkgo v2 with these test labels. @ronazhan please some input on ginkgo v2 change also |
aafa9e4
to
35dfd71
Compare
Yes, @wilfred-s and @yangwwei are right that the port-forward connection needs to be restarted as well after the new scheduler pod is brought up. I don't believe that port-forward is available using the k8s go-client library, so this might have to be done as a shell command I haven't looked much into the ginkgo v2 changes, but the recovery suite can be separated by using |
90d42f3
to
2462ef9
Compare
@yangwwei @wilfred-s @ronazhan Thank you for your inputs. I’ve made most of your suggestions and the e2e tests are passing locally again. I've also verified that the port-forward process we add in the However, the CI checks are still failing. Something is likely off with the port-forwarding logic I added in k8s_utils. |
98884cf
to
0fd6c5c
Compare
Looks like the CI is failing after the recovery test suite, could it be possible there is still something wrong with the port-forwarding logic? Can we collect both stdout and stderr in |
dca07b2
to
467daa7
Compare
@yangwwei Apologies for the relative inactivity on this PR, just getting back to it now. Addressed most of your concerns and am going with the singleton approach since I could not find a way to share context between different ginkgo test suites (only within a suite via If we can verify that all suites execute in independent runtimes (which seems plausible from the log output of the checks), we don't even need to follow singleton pattern since each runtime have its own port forwarder. With the changes I currently have, some checks intermittently fail due to not being able to get allocations post scheduler restart. Many times, all checks are passing, but there is not 100% pass rate when re-running them. I've printed some debug information which shows that we can see the sleep-pod in the dev namespace for both cases, but the allocations are empty in the failing cases. You can search for "appsInfo allocations are" in failing checks to verify this. Any thoughts on what the cause of empty allocations when querying |
hi @anuraagnalluri thanks for the updates! Appreciated! |
There are a couple of things:
|
@yangwwei Thanks for your input. I came to a similar realization that intermittent failures would occur on the plugin mode. I did locally spin up scheduler in plugin mode and ran the Specifically, the suite passes and allocations list is non-empty. I'll keep looking in to this issue. |
hi @anuraagnalluri could you please rebase your changes to the latest master? YK now is an Apache TLP, so we have renamed our repos with some related code changes, hope those won't affect this PR. |
60bdfd3
to
82302ba
Compare
@yangwwei Done, and changed necessary imports. Thanks for getting another pair of eyes on this. I was able to reproduce the error locally a couple times in plugin mode, but am still unsure why the allocations list is empty. When I ran in to the same failure as we see in CI checks, I was able to verify that the applicationID of the sleepjob pod belongs to the newly added Is it possible that plugin-mode logic could specifically affect this behavior in a way normal mode cannot? |
@anuraagnalluri Just to clarify, the remaining failure is due to the test expecting a nil object representing an application's allocations, but instead it receives an allocation list of length 0? This seems like a minor acceptable behavior change, but @yangwwei can you help confirm this to be expected now? Ideally, this should be consistent between plugin and regular Yunikorn versions. For now, I would recommend adding an if statement check to keep the test backwards compatible with previous versions. If the allocations object is not nil, validate that the allocations length is 0. |
@ronazhan Appreciate your help! This is not quite the issue. The issue is that there is a sleep pod deployed in development namespace here, so the test actually expects an allocation list of length 1 here. Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with It's worth noting that the only difference between this and basic_scheduling_test behavior is that we bounce the scheduler pod prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode. |
There's a few things I can see that are issues with the test right away. For one thing, you've structured this as multiple integration tests, when in fact there is really only one. All of the restart / re-forward behavior should go into the setup method, and not be a separate IT test (Ginkgo doesn't guarantee test ordering, so the current structure is brittle, even if it appears to execute correctly). More importantly, there's nothing I can see here that waits for scheduler recovery to complete. That's not a simple thing to do, but you might try scheduling a second pod (after the restart) and wait for it to be running. Once that happens, it should be safe to query for the original pod. |
82302ba
to
cbe9b93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor issues with timing.
cbe9b93
to
39758ac
Compare
@craigcondit Thanks a lot for your inputs. You're absolutely right about the multi-integ test structure. I realize Ginkgo runs its I believe the checks are working now and am willing to change the timeouts if you think they should diverge from |
I think the timeout for the final pod status should probably be 30-60 seconds. We need to allow time for recovery and that might take a little while. A normal pod with an already running scheduler will probably be scheduled in 10 seconds reliably but one on a just-started and not yet recovered one may not. |
39758ac
to
7f4c2a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM. I'll merge this in shortly.
What is this PR for?
Add an e2e verification for restarting the scheduler pod as discussed in YUNIKORN-1040. Added as a flow in ginkgo
beforeSuite
inbasicscheduling_test.go
.What type of PR is it?
Todos
What is the Jira issue?
How should this be tested?
Ran e2e test
Screenshots (if appropriate)
Questions: