[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

anuraagnalluri · 2022-02-14T11:37:08Z

What is this PR for?

Add an e2e verification for restarting the scheduler pod as discussed in YUNIKORN-1040. Added as a flow in ginkgo beforeSuite in basicscheduling_test.go.

What type of PR is it?

Todos

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-1040

How should this be tested?

Ran e2e test

Screenshots (if appropriate)

Questions:

test/e2e/basic_scheduling/basic_scheduling_test.go

codecov · 2022-02-14T23:40:30Z

Codecov Report

Merging #369 (cbe9b93) into master (d618415) will not change coverage.
The diff coverage is n/a.

❗ Current head cbe9b93 differs from pull request most recent head 7f4c2a5. Consider uploading reports for the commit 7f4c2a5 to get more accurate results

@@           Coverage Diff           @@
##           master     #369   +/-   ##
=======================================
  Coverage   65.41%   65.41%           
=======================================
  Files          40       40           
  Lines        6314     6314           
=======================================
  Hits         4130     4130           
  Misses       2023     2023           
  Partials      161      161

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d618415...7f4c2a5. Read the comment docs.

test/e2e/framework/helpers/k8s/k8s_utils.go

anuraagnalluri · 2022-02-15T00:19:16Z

Hi @yangwwei , do you have general pointers on how I can debug the failing CI checks? Running the e2e tests locally on this branch passes, but getting the applications in pre-commit checks seem to fail.

wilfred-s

The scale up and scale down are way to nice to the scheduler. We want to kill the pod that runs the scheduler and then see if the pod gets rescheduled again. The admission controller should not be killed.
The fact that the scheduler runs as a deployment should cause the scheduler pod to be recreated and scheduled again.
We can follow these high level steps:

kClient.GetPodNamesFromNS("YK Namespace")
get the pod named "yunikorn-scheduler-*"
kClient.DeletePod("YK scheduler", "YK Namespace")

Even after that is done and the delete pod has returned without an error we should see a new scheduler pod in ready state within a short amount of time.

anuraagnalluri · 2022-02-18T09:16:07Z

Thanks @wilfred-s, incorporated your suggestions and ran e2e tests locally with success. Not sure why the pre-commit checks seem to fail on getting the applications in CI. Any way I can get access to the cluster these run on and debug?

yangwwei · 2022-02-23T23:24:57Z

Sorry for the late response, I was on vacation last week so emails/notifications got backlogged. For this issue, @ronazhan can you share your thoughts?

I am not entirely sure what is the best way to debug this. The error shown up there:

Unexpected error:
      <*url.Error | 0xc0004c0210>: {
          Op: "Get",
          URL: "http://localhost:9080/ws/v1/apps",
          Err: {s: "EOF"},
      }
      Get "http://localhost:9080/ws/v1/apps": EOF
  occurred

that seems like the REST endpoint could not be accessed. Maybe this is because the scheduler pod gets restarted and we need to redo the port-forwarding? Here is how it was done during the initial setup: https://github.com/apache/incubator-yunikorn-k8shim/blob/a61fc0052c07e510503853db74030290bbda562b/scripts/run-e2e-tests.sh#L193-L199.

wilfred-s · 2022-02-24T03:22:10Z

I don't know what happened with my last update. I must have closed a browser window without saving.
I think that @yangwwei is right the forward of the port has not been recreated after the restart and that will cause the issue for the rest of the tests.

I would also suggest that we move this restart test to its own test suite. Instead of making this part of basic scheduling and causing those tests to fail we should create a "recovery & restart" suite. Move this restart test in there and we can then extend it with e2e recover tests.

Using some more advanced ginkgo tricks we can then make sure these restart or recovery tests are run separate from all the others. Adding labels and splitting the ginkgo run into two runs would allow us to be more destructive in tests. I do think that it requires gingkgo v2 with these test labels. @ronazhan please some input on ginkgo v2 change also

ronazhan · 2022-02-26T04:27:51Z

Yes, @wilfred-s and @yangwwei are right that the port-forward connection needs to be restarted as well after the new scheduler pod is brought up. I don't believe that port-forward is available using the k8s go-client library, so this might have to be done as a shell command

I haven't looked much into the ginkgo v2 changes, but the recovery suite can be separated by using --skip-package flag. Then, it can be run separately than the other test packages

anuraagnalluri · 2022-02-26T10:35:49Z

@yangwwei @wilfred-s @ronazhan Thank you for your inputs. I’ve made most of your suggestions and the e2e tests are passing locally again. I've also verified that the port-forward process we add in the AfterSuite is running + functional after test execution.

However, the CI checks are still failing. Something is likely off with the port-forwarding logic I added in k8s_utils.

yangwwei · 2022-02-28T22:49:03Z

Looks like the CI is failing after the recovery test suite, could it be possible there is still something wrong with the port-forwarding logic? Can we collect both stdout and stderr in PortForwardService while executing exec.Command: , e.g using https://pkg.go.dev/os/exec#Cmd.CombinedOutput to get more info?

anuraagnalluri · 2022-03-21T07:57:19Z

@yangwwei Apologies for the relative inactivity on this PR, just getting back to it now. Addressed most of your concerns and am going with the singleton approach since I could not find a way to share context between different ginkgo test suites (only within a suite via Describe block). Since ginkgo also has the ability to parallelize within a suite, It blocks potentially run within their own "containers". Therefore, if port-forwarding is managed by the go runtime, it should be in the setup for all tests now that issue REST calls to the scheduler svc.

If we can verify that all suites execute in independent runtimes (which seems plausible from the log output of the checks), we don't even need to follow singleton pattern since each runtime have its own port forwarder.

With the changes I currently have, some checks intermittently fail due to not being able to get allocations post scheduler restart. Many times, all checks are passing, but there is not 100% pass rate when re-running them. I've printed some debug information which shows that we can see the sleep-pod in the dev namespace for both cases, but the allocations are empty in the failing cases. You can search for "appsInfo allocations are" in failing checks to verify this.

Any thoughts on what the cause of empty allocations when querying /ws/v1/apps upon scheduler restart may be? Haven't been able to reproduce the error locally.

yangwwei · 2022-03-22T01:50:08Z

hi @anuraagnalluri thanks for the updates! Appreciated!
It is sometimes not so easy to figure out what caused the intermittent issues, pls allow me to look into today and tomorrow, I will get back to you on this. Thanks for the efforts going this far! Appreciated!

yangwwei · 2022-03-23T05:46:23Z

hi @anuraagnalluri

There are a couple of things:

The unit test failure was unrelated to your patch, looks like it was because here we need to use a read lock when to get the cleanupTime. We need a separate JIRA to get this fixed. For this one, as long as we can have a good UT run, we are good.
The failed e2e tests are with --plugin option, which means they are testing the scheduler-plugin mode. The panic happens at /home/runner/work/incubator-yunikorn-k8shim/incubator-yunikorn-k8shim/test/e2e/recovery_and_restart/recovery_and_restart_test.go:120 +0x156c which was because the allocations list retrieved from the scheduler was empty (we can also see that from the debug output). Have you looked at the history if it was always failing under the plugin mode? if that's the case, I suggest rerunning this simple scenario in plugin mode and see if it introduces a different behavior after scheduler restart.

anuraagnalluri · 2022-03-23T09:06:28Z

@yangwwei Thanks for your input. I came to a similar realization that intermittent failures would occur on the plugin mode. I did locally spin up scheduler in plugin mode and ran the recovery_and_restart test suite, but I was not able to reproduce the error we see in failing checks here.

Specifically, the suite passes and allocations list is non-empty. I'll keep looking in to this issue.

yangwwei · 2022-03-28T23:18:24Z

hi @anuraagnalluri could you please rebase your changes to the latest master? YK now is an Apache TLP, so we have renamed our repos with some related code changes, hope those won't affect this PR.
I also spoke with @ronazhan earlier today, he has lots of expertise on e2e tests, he will help to take a look at this issue as well. Thanks!

anuraagnalluri · 2022-03-29T11:00:39Z

@yangwwei Done, and changed necessary imports. Thanks for getting another pair of eyes on this. I was able to reproduce the error locally a couple times in plugin mode, but am still unsure why the allocations list is empty.

When I ran in to the same failure as we see in CI checks, I was able to verify that the applicationID of the sleepjob pod belongs to the newly added recovery_and_restart suite and not basic_scheduling_test. My initial thought was that a "completed" sleepjob with 0 allocations from a previous test could have been picked up, but this is not the case (as that test also tears down the namespace in cleanup). I could see the sleeppod was in "Running" state and ultimately could not identify any metadata differences in the failing case vs. when it's deployed in passing test runs.

Is it possible that plugin-mode logic could specifically affect this behavior in a way normal mode cannot?

test/e2e/recovery_and_restart/recovery_and_restart_test.go

ronazhan · 2022-03-30T19:07:43Z

Specifically, the suite passes and allocations list is non-empty.

appsInfo allocations are: []

@anuraagnalluri Just to clarify, the remaining failure is due to the test expecting a nil object representing an application's allocations, but instead it receives an allocation list of length 0?

This seems like a minor acceptable behavior change, but @yangwwei can you help confirm this to be expected now? Ideally, this should be consistent between plugin and regular Yunikorn versions.

For now, I would recommend adding an if statement check to keep the test backwards compatible with previous versions. If the allocations object is not nil, validate that the allocations length is 0.

anuraagnalluri · 2022-03-30T20:20:03Z

@anuraagnalluri Just to clarify, the remaining failure is due to the test expecting a nil object representing an application's allocations, but instead it receives an allocation list of length 0?

@ronazhan Appreciate your help! This is not quite the issue. The issue is that there is a sleep pod deployed in development namespace here, so the test actually expects an allocation list of length 1 here.

Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with [0], but it's unclear why the corresponding allocation isn't showing in the appsInfo var which is populated with a GetAppInfo call here.

It's worth noting that the only difference between this and basic_scheduling_test behavior is that we bounce the scheduler pod prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode.

craigcondit · 2022-04-01T19:48:03Z

The issue is that there is a sleep pod deployed in development namespace here, so the test actually expects an allocation list of length 1 here.

Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with [0], but it's unclear why the corresponding allocation isn't showing in the appsInfo var which is populated with a GetAppInfo call here.

It's worth noting that the only difference between this and basic_scheduling_test behavior is that we bounce the scheduler pod prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode.

There's a few things I can see that are issues with the test right away. For one thing, you've structured this as multiple integration tests, when in fact there is really only one. All of the restart / re-forward behavior should go into the setup method, and not be a separate IT test (Ginkgo doesn't guarantee test ordering, so the current structure is brittle, even if it appears to execute correctly).

More importantly, there's nothing I can see here that waits for scheduler recovery to complete. That's not a simple thing to do, but you might try scheduling a second pod (after the restart) and wait for it to be running. Once that happens, it should be safe to query for the original pod.

craigcondit

A couple minor issues with timing.

test/e2e/recovery_and_restart/recovery_and_restart_test.go

anuraagnalluri · 2022-04-01T22:43:38Z

@craigcondit Thanks a lot for your inputs. You're absolutely right about the multi-integ test structure. I realize Ginkgo runs its It blocks in independent "containers" with no guarantees on ordering. I also wrongly assumed that a "running" scheduler guarantees its state is updated, but it makes sense that a "running" second deployment would be a better verification for that.

I believe the checks are working now and am willing to change the timeouts if you think they should diverge from basic_scheduling_test. Thanks a lot for your review :)

craigcondit · 2022-04-02T05:52:00Z

I think the timeout for the final pod status should probably be 30-60 seconds. We need to allow time for recovery and that might take a little while. A normal pod with an already running scheduler will probably be scheduled in 10 seconds reliably but one on a just-started and not yet recovered one may not.

craigcondit

+1 LGTM. I'll merge this in shortly.

anuraagnalluri commented Feb 14, 2022

View reviewed changes

test/e2e/basic_scheduling/basic_scheduling_test.go Outdated Show resolved Hide resolved

anuraagnalluri force-pushed the YUNIKORN-1040 branch 4 times, most recently from 1d20074 to 44d7630 Compare February 14, 2022 23:29

anuraagnalluri commented Feb 14, 2022

View reviewed changes

test/e2e/basic_scheduling/basic_scheduling_test.go Outdated Show resolved Hide resolved

anuraagnalluri commented Feb 14, 2022

View reviewed changes

test/e2e/basic_scheduling/basic_scheduling_test.go Outdated Show resolved Hide resolved

anuraagnalluri commented Feb 14, 2022

View reviewed changes

test/e2e/framework/helpers/k8s/k8s_utils.go Outdated Show resolved Hide resolved

anuraagnalluri force-pushed the YUNIKORN-1040 branch from 44d7630 to 9650d01 Compare February 17, 2022 18:44

wilfred-s assigned anuraagnalluri Feb 18, 2022

wilfred-s requested changes Feb 18, 2022

View reviewed changes

anuraagnalluri force-pushed the YUNIKORN-1040 branch from 9650d01 to 87d1195 Compare February 18, 2022 08:56

anuraagnalluri requested a review from wilfred-s February 18, 2022 09:02

anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from aafa9e4 to 35dfd71 Compare February 26, 2022 03:53

anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from 90d42f3 to 2462ef9 Compare February 26, 2022 09:30

anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from 98884cf to 0fd6c5c Compare February 27, 2022 21:33

anuraagnalluri requested a review from yangwwei March 13, 2022 18:39

anuraagnalluri force-pushed the YUNIKORN-1040 branch 7 times, most recently from dca07b2 to 467daa7 Compare March 21, 2022 02:08

anuraagnalluri force-pushed the YUNIKORN-1040 branch from 60bdfd3 to 82302ba Compare March 29, 2022 02:37

anuraagnalluri commented Mar 30, 2022

View reviewed changes

test/e2e/recovery_and_restart/recovery_and_restart_test.go Outdated Show resolved Hide resolved

anuraagnalluri force-pushed the YUNIKORN-1040 branch from 82302ba to cbe9b93 Compare April 1, 2022 21:55

craigcondit requested changes Apr 1, 2022

View reviewed changes

test/e2e/recovery_and_restart/recovery_and_restart_test.go Outdated Show resolved Hide resolved

test/e2e/recovery_and_restart/recovery_and_restart_test.go Outdated Show resolved Hide resolved

anuraagnalluri force-pushed the YUNIKORN-1040 branch from cbe9b93 to 39758ac Compare April 1, 2022 22:40

anuraagnalluri requested a review from craigcondit April 1, 2022 22:43

add e2e test that re-starts the scheduler pod

7f4c2a5

anuraagnalluri force-pushed the YUNIKORN-1040 branch from 39758ac to 7f4c2a5 Compare April 2, 2022 06:02

craigcondit approved these changes Apr 2, 2022

View reviewed changes

craigcondit closed this in 9aca5ce Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

anuraagnalluri commented Feb 14, 2022 •

edited

Loading

codecov bot commented Feb 14, 2022 •

edited

Loading

anuraagnalluri commented Feb 15, 2022

wilfred-s left a comment

anuraagnalluri commented Feb 18, 2022

yangwwei commented Feb 23, 2022

wilfred-s commented Feb 24, 2022

ronazhan commented Feb 26, 2022

anuraagnalluri commented Feb 26, 2022 •

edited

Loading

yangwwei commented Feb 28, 2022

anuraagnalluri commented Mar 21, 2022

yangwwei commented Mar 22, 2022

yangwwei commented Mar 23, 2022

anuraagnalluri commented Mar 23, 2022

yangwwei commented Mar 28, 2022

anuraagnalluri commented Mar 29, 2022

ronazhan commented Mar 30, 2022

anuraagnalluri commented Mar 30, 2022

craigcondit commented Apr 1, 2022

craigcondit left a comment

anuraagnalluri commented Apr 1, 2022

craigcondit commented Apr 2, 2022

craigcondit left a comment

[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

Conversation

anuraagnalluri commented Feb 14, 2022 • edited Loading

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Feb 14, 2022 • edited Loading

Codecov Report

anuraagnalluri commented Feb 15, 2022

wilfred-s left a comment

Choose a reason for hiding this comment

anuraagnalluri commented Feb 18, 2022

yangwwei commented Feb 23, 2022

wilfred-s commented Feb 24, 2022

ronazhan commented Feb 26, 2022

anuraagnalluri commented Feb 26, 2022 • edited Loading

yangwwei commented Feb 28, 2022

anuraagnalluri commented Mar 21, 2022

yangwwei commented Mar 22, 2022

yangwwei commented Mar 23, 2022

anuraagnalluri commented Mar 23, 2022

yangwwei commented Mar 28, 2022

anuraagnalluri commented Mar 29, 2022

ronazhan commented Mar 30, 2022

anuraagnalluri commented Mar 30, 2022

craigcondit commented Apr 1, 2022

craigcondit left a comment

Choose a reason for hiding this comment

anuraagnalluri commented Apr 1, 2022

craigcondit commented Apr 2, 2022

craigcondit left a comment

Choose a reason for hiding this comment

anuraagnalluri commented Feb 14, 2022 •

edited

Loading

codecov bot commented Feb 14, 2022 •

edited

Loading

anuraagnalluri commented Feb 26, 2022 •

edited

Loading