Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and Refactor Goroutine Context Handling #2465

Closed
zhanggbj opened this issue Oct 26, 2023 · 9 comments · Fixed by #2551
Closed

Review and Refactor Goroutine Context Handling #2465

zhanggbj opened this issue Oct 26, 2023 · 9 comments · Fixed by #2551
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@zhanggbj
Copy link
Contributor

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
This is from a discussion here.
To be more specific:

  • Review and refactor the handling of goroutine contexts to ensure they continue independent of the current Reconcile.
  • Currently, passing through the context and stopping goroutines if that context is Done may lead to bugs, such as not reconciling again after a task is finished.
  • The issue aims to identify instances where goroutines are prematurely stopped based on passed-through contexts.
  • Additionally, consider discussing the possibility of adding timeouts to contexts used in these goroutines to prevent goroutine leaks or explore alternative solutions.

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api-provider-vsphere version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2023
@chrischdi
Copy link
Member

chrischdi commented Oct 26, 2023

There are some cases to check:

  1. controllers/vspherecluster_reconciler.go
  1. pkg/manager/manager.go
  1. pkg/services/govmomi/util.go

@chrischdi
Copy link
Member

1.a.controllers/vspherecluster_reconciler.go#L458

  • This goroutine may be leaked if apiserver never gets online.
  • It is there to inject an event to reconcile the vspherecluster again as soon as the apiserver is online.
  • We could use a context with timeout instead to ensure it gets canceled after some time.
  • We would need to check if a requeued reconciliation works up things

Conclusion: this go routines could get improved (what to be improved: to be defined

2.a. pkg/manager/manager.go#L135 and pkg/manager/manager.go#L147

  • These two go routines are for watching and updating credentials of the manager based on the filesystem.
  • Both are only created once on startup via main.go so a constant of 2 go routines and it should not leak any.

Conclusion: this go routines are OK

@chrischdi
Copy link
Member

The go routines are 3 are more complex to analyse and also kind of depend on each other sometimes.

3.a. util.go#L219

  • This go routine works on a channel using a for loop unti it is closed.
  • The channel gets closed after usage.
  • But it depends on the go routine in 3.e. to succeed.

Conclusion: this go routine is OK as is

@chrischdi
Copy link
Member

I will need some more time to take a look at 3.b. to 3.e. , if anyone wants to go ahead, feel free.

@zhanggbj
Copy link
Contributor Author

/assign
I would like to start with 1.a

@chrischdi
Copy link
Member

3.b. util.go#L280

  • This go routine waits for a function (which is passed as argument) to complete
  • and sends an event to a channel

There is only one case where this is used, so there is only one type of function which this go routine waits for completion:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/9cec76143d92355bac9f2df261029ddf880bef4c/pkg/services/govmomi/util.go#L250-272

For the case of the analysis we can strip this down to taskHelper.WaitForResult which under the hood calls govmomis task.Wait function to wait for a given task to either succeed or fail.

  • It is there to inject an event to reconcile the object again as soon as a task on vCenter did finish.
  • We could use a context with timeout to ensure it gets canceled after some time. At latest it should get picked up via the resync period.

Conclusion: this go routines could maybe get improved (what to be improved: to be defined)

@chrischdi
Copy link
Member

chrischdi commented Nov 21, 2023

For the cases 3.b, 3.d and 3.e I created a picture how they get invoked (they all are somewhat related to each other):

image

The high level description of the reconcileVSphereVMWhenNetworkIsReady with regards to the go routine parts comes down to:

  • Start a goroutine via reconcileVSphereVMOnChannel (3.c) and return

3.c. util.go#L304

High-level: This go routine is then responsible for emitting events to trigger reconciliation when a VM gets new IP addresses set in vCenter, which ends when all expected IP addresses got set.

The lower level description is

  • run the waitFn function which
    • waits for the VM to be in powered on state (in vCenter on the VM object)
    • wait for all nics to have a MAC address set (in vCenter on the VM object)
    • start a go routine (3.e) which waits for changes regarding IP addresses and sends messages to a channel (channel A)
    • return two channels (for reference call it channel A and channel error)
  • use for select to either
    • read a message from channel A:
      • start a go routine (3.d) to write a message to the GVK channel to trigger reconciliation
    • read a message from channel error:
      • log & return / exit the go routine
    • wait for the VM to have all expected IP addresses set (in vCenter on the VM object)

Conclusion: no conclusion yet, very complex, but looks like we don't need all the indirections / passing down functions and could make more readable what happens. Also thinking about if we could eliminate 3.d. and directly write to the other channel

3.d. util.go#L316

Writes a single message to the GVK channel to trigger a reconciliation

Conclusion: The go routine should be done immediately after writing to the GVK channel

3.e. util.go#L568

  • Waits for changes regarding IP addresses
  • Sends a message for every changed IP address to a channel, which then gets read inside go routine from 3.c and triggers 3.d
  • ends if all expected ip addresses got set

Conclusion: A potentially leak of a go routine if a VM never gets all IP addresses set

@sbueringer
Copy link
Member

sbueringer commented Dec 13, 2023

My take:
(I'll edit this comment while I go through the cases)

  1. Let's drop the entire reconcileVSphereClusterWhenAPIServerIsOnline thing. As far as I can tell nowadays there is nothing in the VSpherCluster controller that depends on the kube-apiserver
  2. Agree, nothing to do
  3. Probably would keep it as is, but use context.Background for 3.e

@sbueringer
Copy link
Member

Thx @chrischdi for all the research. Really nice work!

Discussed with Christian what we want to do. We would:

  1. Drop the code because it's not used anymore
  2. e) use context.Background()

In all other cases we would keep it as is. For 3 it's mostly because there are currently no known issues and we don't want to take the risk of breaking this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants