-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topology-updater: reactive updates #1031
topology-updater: reactive updates #1031
Conversation
Hi @Tal-or. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Tal-or for the PR. I had a quick run-through and some comments.
Also, some of the commit message subject lines are a bit misleading or ambiguous, adjust e.g.:
main: add kubelet-dir-path flag
->topology-updater: add kubelet-dir-path flag
log: log event type
->topology-updater: log event type that triggered update
manifests: add mount for kubelet dir
->deployment/topology-updater: add mount for kubelet dir
deployment/components/topology-updater/topologyupdater-mounts.yaml
Outdated
Show resolved
Hide resolved
Thanks for the quick review. This PR need more polish, but it's good enough to start the conversation going |
I know the rationale because I worked on the prototype implementation. I think we should elaborate more in the PR description what's the benefit for NFD-topology-updater users. It's ok to start minimal because this PR is a conversation starter (#1031 (comment)). Let's just make sure we have a more complete rationale when the PR is polished. |
Yes I agree. We should emphasis the benefits of the reactive updates and their advantages when scalability is a concern. |
for { | ||
select { | ||
case <-crTrigger.C: | ||
klog.Infof("Scanning") | ||
case info := <-w.eventSource: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler and clearer instead of pushing events from Notifier into new channel. Just to change this select
to read from <-timerEvent
and <-n.fsEvent
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create and configure the fs watcher, then filter the events we received from it and passing only the relevant ones to the topology-updater select loop.
There is quite some work to do, hence the reason for moving this logic into a separate package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like this:
+ var crTrigger *time.Ticker
+ if w.resourcemonitorArgs.SleepInterval > 0 {
+ crTrigger = time.NewTicker(w.resourcemonitorArgs.SleepInterval)
+ }
+ ch, err := createFSWatcherEvent([]string{kubeletDirPath})
+ if err != nil {
+ return fmt.Errorf("failed to obtain node resource information: %w", err)
+ }
for {
select {
case <-crTrigger.C:
- klog.Infof("Scanning")
- podResources, err := resScan.Scan()
- utils.KlogDump(1, "podResources are", " ", podResources)
- if err != nil {
- klog.Warningf("Scan failed: %v", err)
- continue
- }
- zones = resAggr.Aggregate(podResources)
- utils.KlogDump(1, "After aggregating resources identified zones are", " ", zones)
- if !w.args.NoPublish {
- if err = w.updateNodeResourceTopology(zones); err != nil {
- return err
- }
- }
-
+ w.runUpdater(resScan, resAggr)
if w.args.Oneshot {
return nil
}
+ case e := <-n.fsEvent:
+ klog.V(5).Infof("fsnotify event from file %q: %q received", e.Name, e.Op)
+ if e.Name == CPUStateFile ||
+ e.Name == MemoryStateFile ||
+ e.Name == DevicesStateFile {
+ i := Info{Event: FSUpdate}
+ w.runUpdater(resScan, resAggr)
+ }
case <-w.stop:
klog.Infof("shutting down nfd-topology-updater")
return nil
@@ -164,6 +164,28 @@ func (w *nfdTopologyUpdater) Run() error {
}
+func (w *nfdTopologyUpdater) runUpdater(resScan resourcemonitor.ResourcesScanner, resAggr resourcemonitor.ResourcesAggregator) {
+ zones := w.aggregateZones(resScan, resAggr)
+ utils.KlogDump(1, "After aggregating resources identified zones are", " ", zones)
+ if !w.args.NoPublish {
+ if err := w.updateNodeResourceTopology(zones); err != nil {
+ klog.Warningf("cannot update NodeResourceTopology: %s", err.Error())
+ }
+ }
+}
+
+func (w *nfdTopologyUpdater) aggregateZones(resScan resourcemonitor.ResourcesScanner, resAggr resourcemonitor.ResourcesAggregator) v1alpha1.ZoneList {
+ klog.Infof("Scanning")
+ podResources, err := resScan.Scan()
+ utils.KlogDump(1, "podResources are", " ", podResources)
+ if err != nil {
+ klog.Warningf("Scan failed: %v", err)
+ return nil
+ }
+
+ return resAggr.Aggregate(podResources)
+}
+
we could just put common logic to new funcs and remove necessity of having separate struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that inline everything might be clearer, but IMVHO decoupling the event source generator from the main loop has greater benefits:
For example you can generate your own source of events if needed, without changing the main loop logic.
This is useful especially for testing, in case you want to mock notification events.
@PiotrProkop WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I have no strong objections against putting events sources behind this abstraction, just wanted to indicate that this could be simplified. Both approaches works for me 😄
/ok-to-test |
Sorry for the long absence i'll address the comments in the following days |
996bfe3
to
a5e9b8a
Compare
@marquiz @PiotrProkop I would like to share my thoughts with you about the continuation of this work. |
@Tal-or Not everyone are using reserve plugin(for us it's DiscardReservedNodes plugin) and we are using our custom patches in NFD for reactive updates, so we are very interested in this work. If you don't have bandwidth to take care of this PR I would be happy to take over this work and finish it 😄 |
@PiotrProkop I'm very happy to hear that you're interesting in this work, then i'll continue it in the coming days. |
@Tal-or what do you think about the timeline for this? I was thinking to have a bit more predicable/more frequent release cadence and cut v0.13.0 around end of March. Releasing whatever we've managed to get in by that time 😊 It would be great to get this in and let people drop some of the custom patches |
Hey @marquiz I have some time today, let me rebase + addressing open comments and let's see how the next review cycle goes |
Not super urgent, just wanted to give you head's up |
a5e9b8a
to
94f84b4
Compare
b75b46f
to
7af3de6
Compare
/cc |
Enabling reactive update for nfd-topology-updater by detecting changes in Kubelet state/checkpoint files, and signaling to the main loop to update the NodeResourceTopology objects. This has high value when scaling is an issue. Having multiple pods deployed in between single update instance might reflect incorrect resource accounting in the NRT CRs. Example: Time Interval = 5s t0 - New update sent to NRT CRs t1 - Schedule guaranteed podA t2 - Schedule guaranteed podB time elapsed between t0-t2 < 5 seconds, IOW the update on t0 is the recent update. In t2 the resource accounting reflected by NRT is not aligned with the actual accounting because NRT CRs doesn't reflect the change happened in t1. With this reactive update feature we expect an update to be trigger between t1 and t2 so the NRT objects will reflect more accurate picture. There still might be a scenario when the updates aren't fast enough, but this is an additional future planned optimization. The notifier has two event types: 1. Time based - keeping the old behavior, trigger an update per interval. 2. FS event - trigger an update when Kubelet state/checkpoint files modified. Signed-off-by: Talor Itzhak <[email protected]>
On different Kubernetes flavors like OpenShift for exmaple, the Kubelet state directory path is different. make it configurable for maximum flexability. Signed-off-by: Talor Itzhak <[email protected]>
When a message received via the channel, the main loop updates the `NodeResourceTopology` objects. The notifier will send a message via the channel if: 1. It reached the sleep timeout. 2. It detected a change in Kubelet state files Signed-off-by: Talor Itzhak <[email protected]>
Specify the event type as part of the log message. In order to reduce the log volume, make it V4 Signed-off-by: Talor Itzhak <[email protected]>
This mount is needed for watching the state files Signed-off-by: Talor Itzhak <[email protected]>
Especially convenient for testing porpuses and completely harmless Signed-off-by: Talor Itzhak <[email protected]>
Signed-off-by: Talor Itzhak <[email protected]>
65882ef
to
e20cc06
Compare
f261c80
to
2d7890b
Compare
Adding kubelet state directory mount Signed-off-by: Talor Itzhak <[email protected]>
Signed-off-by: Talor Itzhak <[email protected]>
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it. The feature is enabled by default. Signed-off-by: Talor Itzhak <[email protected]>
2d7890b
to
5c6be58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Tal-or for the persistence on this. I think I don't have any more feedback and would be ready to get this in
ping @ffromani @PiotrProkop
thanks for the ping @marquiz I'll review shortly |
LGTM but I'll leave final approval to @ffromani . Good work @Tal-or ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
very nice work indeed. Kudos to @Tal-or
LGTM label has been added. Git tree hash: 65e359b51ea6ae20c384f683aae3e0046e570f40
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, marquiz, Tal-or The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.
This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.
In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.
With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.
There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.
The notifier has two event types:
an update per interval.