Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-1040] add e2e test that re-starts the scheduler pod #369

Closed
wants to merge 1 commit into from

Conversation

anuraagnalluri
Copy link
Contributor

@anuraagnalluri anuraagnalluri commented Feb 14, 2022

What is this PR for?

Add an e2e verification for restarting the scheduler pod as discussed in YUNIKORN-1040. Added as a flow in ginkgo beforeSuite in basicscheduling_test.go.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

What is the Jira issue?

How should this be tested?

Ran e2e test

Screenshots (if appropriate)

Questions:

@anuraagnalluri anuraagnalluri force-pushed the YUNIKORN-1040 branch 4 times, most recently from 1d20074 to 44d7630 Compare February 14, 2022 23:29
@codecov
Copy link

codecov bot commented Feb 14, 2022

Codecov Report

Merging #369 (cbe9b93) into master (d618415) will not change coverage.
The diff coverage is n/a.

❗ Current head cbe9b93 differs from pull request most recent head 7f4c2a5. Consider uploading reports for the commit 7f4c2a5 to get more accurate results

@@           Coverage Diff           @@
##           master     #369   +/-   ##
=======================================
  Coverage   65.41%   65.41%           
=======================================
  Files          40       40           
  Lines        6314     6314           
=======================================
  Hits         4130     4130           
  Misses       2023     2023           
  Partials      161      161           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d618415...7f4c2a5. Read the comment docs.

@anuraagnalluri
Copy link
Contributor Author

Hi @yangwwei , do you have general pointers on how I can debug the failing CI checks? Running the e2e tests locally on this branch passes, but getting the applications in pre-commit checks seem to fail.

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale up and scale down are way to nice to the scheduler. We want to kill the pod that runs the scheduler and then see if the pod gets rescheduled again. The admission controller should not be killed.
The fact that the scheduler runs as a deployment should cause the scheduler pod to be recreated and scheduled again.
We can follow these high level steps:

kClient.GetPodNamesFromNS("YK Namespace")
get the pod named "yunikorn-scheduler-*"
kClient.DeletePod("YK scheduler", "YK Namespace")

Even after that is done and the delete pod has returned without an error we should see a new scheduler pod in ready state within a short amount of time.

@anuraagnalluri
Copy link
Contributor Author

Thanks @wilfred-s, incorporated your suggestions and ran e2e tests locally with success. Not sure why the pre-commit checks seem to fail on getting the applications in CI. Any way I can get access to the cluster these run on and debug?

@yangwwei
Copy link
Contributor

Sorry for the late response, I was on vacation last week so emails/notifications got backlogged. For this issue, @ronazhan can you share your thoughts?

I am not entirely sure what is the best way to debug this. The error shown up there:

Unexpected error:
      <*url.Error | 0xc0004c0210>: {
          Op: "Get",
          URL: "http://localhost:9080/ws/v1/apps",
          Err: {s: "EOF"},
      }
      Get "http://localhost:9080/ws/v1/apps": EOF
  occurred

that seems like the REST endpoint could not be accessed. Maybe this is because the scheduler pod gets restarted and we need to redo the port-forwarding? Here is how it was done during the initial setup: https://github.com/apache/incubator-yunikorn-k8shim/blob/a61fc0052c07e510503853db74030290bbda562b/scripts/run-e2e-tests.sh#L193-L199.

@wilfred-s
Copy link
Contributor

I don't know what happened with my last update. I must have closed a browser window without saving.
I think that @yangwwei is right the forward of the port has not been recreated after the restart and that will cause the issue for the rest of the tests.

I would also suggest that we move this restart test to its own test suite. Instead of making this part of basic scheduling and causing those tests to fail we should create a "recovery & restart" suite. Move this restart test in there and we can then extend it with e2e recover tests.

Using some more advanced ginkgo tricks we can then make sure these restart or recovery tests are run separate from all the others. Adding labels and splitting the ginkgo run into two runs would allow us to be more destructive in tests. I do think that it requires gingkgo v2 with these test labels. @ronazhan please some input on ginkgo v2 change also

@anuraagnalluri anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from aafa9e4 to 35dfd71 Compare February 26, 2022 03:53
@ronazhan
Copy link
Contributor

Yes, @wilfred-s and @yangwwei are right that the port-forward connection needs to be restarted as well after the new scheduler pod is brought up. I don't believe that port-forward is available using the k8s go-client library, so this might have to be done as a shell command

I haven't looked much into the ginkgo v2 changes, but the recovery suite can be separated by using --skip-package flag. Then, it can be run separately than the other test packages

@anuraagnalluri anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from 90d42f3 to 2462ef9 Compare February 26, 2022 09:30
@anuraagnalluri
Copy link
Contributor Author

anuraagnalluri commented Feb 26, 2022

@yangwwei @wilfred-s @ronazhan Thank you for your inputs. I’ve made most of your suggestions and the e2e tests are passing locally again. I've also verified that the port-forward process we add in the AfterSuite is running + functional after test execution.

However, the CI checks are still failing. Something is likely off with the port-forwarding logic I added in k8s_utils.

@anuraagnalluri anuraagnalluri force-pushed the YUNIKORN-1040 branch 3 times, most recently from 98884cf to 0fd6c5c Compare February 27, 2022 21:33
@yangwwei
Copy link
Contributor

Looks like the CI is failing after the recovery test suite, could it be possible there is still something wrong with the port-forwarding logic? Can we collect both stdout and stderr in PortForwardService while executing exec.Command: , e.g using https://pkg.go.dev/os/exec#Cmd.CombinedOutput to get more info?

@anuraagnalluri anuraagnalluri requested a review from yangwwei March 13, 2022 18:39
@anuraagnalluri anuraagnalluri force-pushed the YUNIKORN-1040 branch 7 times, most recently from dca07b2 to 467daa7 Compare March 21, 2022 02:08
@anuraagnalluri
Copy link
Contributor Author

@yangwwei Apologies for the relative inactivity on this PR, just getting back to it now. Addressed most of your concerns and am going with the singleton approach since I could not find a way to share context between different ginkgo test suites (only within a suite via Describe block). Since ginkgo also has the ability to parallelize within a suite, It blocks potentially run within their own "containers". Therefore, if port-forwarding is managed by the go runtime, it should be in the setup for all tests now that issue REST calls to the scheduler svc.

If we can verify that all suites execute in independent runtimes (which seems plausible from the log output of the checks), we don't even need to follow singleton pattern since each runtime have its own port forwarder.

With the changes I currently have, some checks intermittently fail due to not being able to get allocations post scheduler restart. Many times, all checks are passing, but there is not 100% pass rate when re-running them. I've printed some debug information which shows that we can see the sleep-pod in the dev namespace for both cases, but the allocations are empty in the failing cases. You can search for "appsInfo allocations are" in failing checks to verify this.

Any thoughts on what the cause of empty allocations when querying /ws/v1/apps upon scheduler restart may be? Haven't been able to reproduce the error locally.

@yangwwei
Copy link
Contributor

hi @anuraagnalluri thanks for the updates! Appreciated!
It is sometimes not so easy to figure out what caused the intermittent issues, pls allow me to look into today and tomorrow, I will get back to you on this. Thanks for the efforts going this far! Appreciated!

@yangwwei
Copy link
Contributor

hi @anuraagnalluri

There are a couple of things:

  1. The unit test failure was unrelated to your patch, looks like it was because here we need to use a read lock when to get the cleanupTime. We need a separate JIRA to get this fixed. For this one, as long as we can have a good UT run, we are good.
  2. The failed e2e tests are with --plugin option, which means they are testing the scheduler-plugin mode. The panic happens at /home/runner/work/incubator-yunikorn-k8shim/incubator-yunikorn-k8shim/test/e2e/recovery_and_restart/recovery_and_restart_test.go:120 +0x156c which was because the allocations list retrieved from the scheduler was empty (we can also see that from the debug output). Have you looked at the history if it was always failing under the plugin mode? if that's the case, I suggest rerunning this simple scenario in plugin mode and see if it introduces a different behavior after scheduler restart.

@anuraagnalluri
Copy link
Contributor Author

@yangwwei Thanks for your input. I came to a similar realization that intermittent failures would occur on the plugin mode. I did locally spin up scheduler in plugin mode and ran the recovery_and_restart test suite, but I was not able to reproduce the error we see in failing checks here.

Specifically, the suite passes and allocations list is non-empty. I'll keep looking in to this issue.

@yangwwei
Copy link
Contributor

hi @anuraagnalluri could you please rebase your changes to the latest master? YK now is an Apache TLP, so we have renamed our repos with some related code changes, hope those won't affect this PR.
I also spoke with @ronazhan earlier today, he has lots of expertise on e2e tests, he will help to take a look at this issue as well. Thanks!

@anuraagnalluri
Copy link
Contributor Author

@yangwwei Done, and changed necessary imports. Thanks for getting another pair of eyes on this. I was able to reproduce the error locally a couple times in plugin mode, but am still unsure why the allocations list is empty.

When I ran in to the same failure as we see in CI checks, I was able to verify that the applicationID of the sleepjob pod belongs to the newly added recovery_and_restart suite and not basic_scheduling_test. My initial thought was that a "completed" sleepjob with 0 allocations from a previous test could have been picked up, but this is not the case (as that test also tears down the namespace in cleanup). I could see the sleeppod was in "Running" state and ultimately could not identify any metadata differences in the failing case vs. when it's deployed in passing test runs.

Is it possible that plugin-mode logic could specifically affect this behavior in a way normal mode cannot?

@ronazhan
Copy link
Contributor

Specifically, the suite passes and allocations list is non-empty.

appsInfo allocations are: []

@anuraagnalluri Just to clarify, the remaining failure is due to the test expecting a nil object representing an application's allocations, but instead it receives an allocation list of length 0?

This seems like a minor acceptable behavior change, but @yangwwei can you help confirm this to be expected now? Ideally, this should be consistent between plugin and regular Yunikorn versions.

For now, I would recommend adding an if statement check to keep the test backwards compatible with previous versions. If the allocations object is not nil, validate that the allocations length is 0.

@anuraagnalluri
Copy link
Contributor Author

@anuraagnalluri Just to clarify, the remaining failure is due to the test expecting a nil object representing an application's allocations, but instead it receives an allocation list of length 0?

@ronazhan Appreciate your help! This is not quite the issue. The issue is that there is a sleep pod deployed in development namespace here, so the test actually expects an allocation list of length 1 here.

Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with [0], but it's unclear why the corresponding allocation isn't showing in the appsInfo var which is populated with a GetAppInfo call here.

It's worth noting that the only difference between this and basic_scheduling_test behavior is that we bounce the scheduler pod prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode.

@craigcondit
Copy link
Contributor

The issue is that there is a sleep pod deployed in development namespace here, so the test actually expects an allocation list of length 1 here.

Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with [0], but it's unclear why the corresponding allocation isn't showing in the appsInfo var which is populated with a GetAppInfo call here.

It's worth noting that the only difference between this and basic_scheduling_test behavior is that we bounce the scheduler pod prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode.

There's a few things I can see that are issues with the test right away. For one thing, you've structured this as multiple integration tests, when in fact there is really only one. All of the restart / re-forward behavior should go into the setup method, and not be a separate IT test (Ginkgo doesn't guarantee test ordering, so the current structure is brittle, even if it appears to execute correctly).

More importantly, there's nothing I can see here that waits for scheduler recovery to complete. That's not a simple thing to do, but you might try scheduling a second pod (after the restart) and wait for it to be running. Once that happens, it should be safe to query for the original pod.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor issues with timing.

@anuraagnalluri
Copy link
Contributor Author

@craigcondit Thanks a lot for your inputs. You're absolutely right about the multi-integ test structure. I realize Ginkgo runs its It blocks in independent "containers" with no guarantees on ordering. I also wrongly assumed that a "running" scheduler guarantees its state is updated, but it makes sense that a "running" second deployment would be a better verification for that.

I believe the checks are working now and am willing to change the timeouts if you think they should diverge from basic_scheduling_test. Thanks a lot for your review :)

@craigcondit
Copy link
Contributor

I think the timeout for the final pod status should probably be 30-60 seconds. We need to allow time for recovery and that might take a little while. A normal pod with an already running scheduler will probably be scheduled in 10 seconds reliably but one on a just-started and not yet recovered one may not.

Copy link
Contributor

@craigcondit craigcondit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM. I'll merge this in shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants