Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve traffic-managers podCIDR discovery #2093

Merged
merged 11 commits into from
Sep 21, 2021

Conversation

thallgren
Copy link
Member

Description

This PR contains two improvements to the pod CIDR discovery mechanism.

  1. Performance improved by replacing the kates watchers with standard k8s APIs ([v2] traffic-manager high CPU usage after some undetermined amount of usage #2084).
  2. Make pod discovery strategy configurable in the Helm chart (Subnet retrieval should be more flexible #2023).

Checklist

  • I made sure to update ./CHANGELOG.md.
  • My change is adequately tested.

@thallgren thallgren requested a review from josecv September 17, 2021 15:03
@thallgren thallgren self-assigned this Sep 17, 2021
@thallgren thallgren added bug Something isn't working feature New feature or enhancement request labels Sep 17, 2021
@thallgren thallgren added this to the 2021 Cycle 6 milestone Sep 17, 2021
Copy link
Member

@josecv josecv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but i'm somewhat confused about the watch function --just wanna make sure that it actually is being invoked, otherwise none of this will work 🙃

# nodePodCIDRs extract CIDRs from the podCIDR and podCIDRs field of the Node Spec.
# coverPodIPs extract IPs from the podIP and podIPs field of the Pod Status and compute the CIDRs needed to cover those IPs.
# environment use CIDRs listed in the space separated POD_CIDRS environment variable verbatim.
# auto first tru nodePodCIDRs and if that fails, tru coverPodIPs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo -- it says tru instead of try

informer cache.SharedIndexInformer
subnets []*net.IPNet
changed time.Time
lock sync.Mutex // Protects all access to ipsMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this read Protects all access to subnets?

return changed
}

func (w *nodeWatcher) watch(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i'm just confused but it seems like watch is never invoked?

Comment on lines 64 to 108
func (w *nodeWatcher) viable(ctx context.Context) bool {
nodes, err := w.lister.List(labels.Everything())
if err != nil {
dlog.Errorf(ctx, "unable to list nodes: %v", err)
return false
}

// Create the initial snapshot
w.lock.Lock()
defer w.lock.Unlock()
changed := false
dlog.Infof(ctx, "Scanning %d nodes", len(nodes))
for _, node := range nodes {
if w.addLocked(nodeSubnets(ctx, node)) {
changed = true
}
}
if changed {
dlog.Infof(ctx, "Found %d subnets", len(w.subnets))
w.changed = time.Now()
} else {
dlog.Info(ctx, "No subnets found")
}
return changed
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this function combines initialization with a precondition check (i.e. "Are we actually able to watch subnets from nodes?") -- that's okay, but calling it twice in a row will make the second call return false, no? Maybe we should short-circuit return if w.subnets is already populated -- after all, if they're there then getting them is definitely viable

@@ -0,0 +1,247 @@
package cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments apply here as for the nodewatcher :)

@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch 2 times, most recently from b7b0415 to 497ec40 Compare September 18, 2021 14:59
The kates package provides nice and simple to use generalisations, but
it also introduces overhead and removes the ability to control some
aspects of watching. This commit replaces the traffic-manager's use of
the kates package with more standard k8s APIs.

Signed-off-by: Thomas Hallgren <[email protected]>
Fixes the
```
goroutine "/daemon/watch-cluster-info" exited with error: PANIC: close
of closed channel
```
that happens the second time the traffic-manager reports status info.

Signed-off-by: Thomas Hallgren <[email protected]>
Adds two environment variables to the traffic manager that controls
how it discovers what subnets the cluster is using for the pods. The
environment variables are configurable in the Helm chart.

Signed-off-by: Thomas Hallgren <[email protected]>
@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch from 497ec40 to 3c13df6 Compare September 18, 2021 15:05
@thallgren
Copy link
Member Author

The watch function was never called, but that only meant that future updates to podCIDRs after the initial scan where undetected. The initial scan happened anyway (in viable). Not good of course, but not a case of "none of this will work" either.

Creates a new `subnet.Set` datatype with methods for adding, deleting,
and appending `net.IPNet` instances.

Signed-off-by: Thomas Hallgren <[email protected]>
Signed-off-by: Thomas Hallgren <[email protected]>
Signed-off-by: Thomas Hallgren <[email protected]>
@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch from 3c13df6 to 07706f3 Compare September 19, 2021 06:45
@@ -79,7 +79,7 @@ podCIDRs: []
# nodePodCIDRs extract CIDRs from the podCIDR and podCIDRs field of the Node Spec.
# coverPodIPs extract IPs from the podIP and podIPs field of the Pod Status and compute the CIDRs needed to cover those IPs.
# environment use CIDRs listed in the space separated POD_CIDRS environment variable verbatim.
# auto first tru nodePodCIDRs and if that fails, tru coverPodIPs
# auto first try nodePodCIDRs and if that fails, tru coverPodIPs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still says tru coverPodIPs :^)

@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch 2 times, most recently from 0045fb6 to 07706f3 Compare September 21, 2021 10:02
The 2 minutes timeout would sometimes fail.

Signed-off-by: Thomas Hallgren <[email protected]>
If the read at line 901 returned an error other than `io.EOF`, the
surrounding loop would continue in infinity. This commit ensures that
the loop brakes on any error.

Signed-off-by: Thomas Hallgren <[email protected]>
@donnyyung donnyyung mentioned this pull request Sep 21, 2021
4 tasks
@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch from f2326be to b592579 Compare September 21, 2021 13:42
@thallgren thallgren force-pushed the thallgren/replace-agent-kates branch from b592579 to 3fd1746 Compare September 21, 2021 14:23
@thallgren thallgren merged commit d4f8deb into release/v2 Sep 21, 2021
@thallgren thallgren deleted the thallgren/replace-agent-kates branch September 21, 2021 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature New feature or enhancement request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants