Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology-updater: reactive updates #1031

Merged

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Jan 12, 2023

Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.

This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.

In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.

With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.

There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.

The notifier has two event types:

  1. Time based - keeping the old behavior, trigger
    an update per interval.
  2. FS event - trigger an update when Kubelet state/checkpoint files modified.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 12, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @Tal-or. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 12, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 12, 2023
@netlify
Copy link

netlify bot commented Jan 12, 2023

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit 5c6be58
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/6412e7165261190008d34209
😎 Deploy Preview https://deploy-preview-1031--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Tal-or for the PR. I had a quick run-through and some comments.

Also, some of the commit message subject lines are a bit misleading or ambiguous, adjust e.g.:

  • main: add kubelet-dir-path flag -> topology-updater: add kubelet-dir-path flag
  • log: log event type -> topology-updater: log event type that triggered update
  • manifests: add mount for kubelet dir -> deployment/topology-updater: add mount for kubelet dir

pkg/notifier/notifier.go Outdated Show resolved Hide resolved
cmd/nfd-topology-updater/main.go Outdated Show resolved Hide resolved
cmd/nfd-topology-updater/main.go Outdated Show resolved Hide resolved
pkg/notifier/notifier.go Outdated Show resolved Hide resolved
cmd/nfd-topology-updater/main.go Outdated Show resolved Hide resolved
@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 12, 2023

Thanks for the quick review. This PR need more polish, but it's good enough to start the conversation going

@ffromani
Copy link
Contributor

Enabling reactive update for nfd-topology-updater by detecting changes in Kubelet state/checkpoint files, and signaling to the main loop to update the NodeResourceTopology objects.

The notifier has two event types:

1. Time based - keeping the old behavior, trigger
   an update per interval.

2. FS event - trigger an update when Kubelet state/checkpoint files modified.

I know the rationale because I worked on the prototype implementation. I think we should elaborate more in the PR description what's the benefit for NFD-topology-updater users. It's ok to start minimal because this PR is a conversation starter (#1031 (comment)). Let's just make sure we have a more complete rationale when the PR is polished.

@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 12, 2023

Enabling reactive update for nfd-topology-updater by detecting changes in Kubelet state/checkpoint files, and signaling to the main loop to update the NodeResourceTopology objects.
The notifier has two event types:

1. Time based - keeping the old behavior, trigger
   an update per interval.

2. FS event - trigger an update when Kubelet state/checkpoint files modified.

I know the rationale because I worked on the prototype implementation. I think we should elaborate more in the PR description what's the benefit for NFD-topology-updater users. It's ok to start minimal because this PR is a conversation starter (#1031 (comment)). Let's just make sure we have a more complete rationale when the PR is polished.

Yes I agree. We should emphasis the benefits of the reactive updates and their advantages when scalability is a concern.

for {
select {
case <-crTrigger.C:
klog.Infof("Scanning")
case info := <-w.eventSource:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler and clearer instead of pushing events from Notifier into new channel. Just to change this select to read from <-timerEvent and <-n.fsEvent directly?

Copy link
Contributor Author

@Tal-or Tal-or Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create and configure the fs watcher, then filter the events we received from it and passing only the relevant ones to the topology-updater select loop.

There is quite some work to do, hence the reason for moving this logic into a separate package

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like this:

+       var crTrigger *time.Ticker
+       if w.resourcemonitorArgs.SleepInterval > 0 {
+               crTrigger = time.NewTicker(w.resourcemonitorArgs.SleepInterval)
+       }
+       ch, err := createFSWatcherEvent([]string{kubeletDirPath})
+       if err != nil {
+               return fmt.Errorf("failed to obtain node resource information: %w", err)
+       }
        for {
                select {
                case <-crTrigger.C:
-                       klog.Infof("Scanning")
-                       podResources, err := resScan.Scan()
-                       utils.KlogDump(1, "podResources are", "  ", podResources)
-                       if err != nil {
-                               klog.Warningf("Scan failed: %v", err)
-                               continue
-                       }
-                       zones = resAggr.Aggregate(podResources)
-                       utils.KlogDump(1, "After aggregating resources identified zones are", "  ", zones)
-                       if !w.args.NoPublish {
-                               if err = w.updateNodeResourceTopology(zones); err != nil {
-                                       return err
-                               }
-                       }
-
+                       w.runUpdater(resScan, resAggr)
                        if w.args.Oneshot {
                                return nil
                        }

+               case e := <-n.fsEvent:
+                       klog.V(5).Infof("fsnotify event from file %q: %q received", e.Name, e.Op)
+                       if e.Name == CPUStateFile ||
+                               e.Name == MemoryStateFile ||
+                               e.Name == DevicesStateFile {
+                               i := Info{Event: FSUpdate}
+                               w.runUpdater(resScan, resAggr)
+                       }
                case <-w.stop:
                        klog.Infof("shutting down nfd-topology-updater")
                        return nil
@@ -164,6 +164,28 @@ func (w *nfdTopologyUpdater) Run() error {

 }

+func (w *nfdTopologyUpdater) runUpdater(resScan resourcemonitor.ResourcesScanner, resAggr resourcemonitor.ResourcesAggregator) {
+       zones := w.aggregateZones(resScan, resAggr)
+       utils.KlogDump(1, "After aggregating resources identified zones are", "  ", zones)
+       if !w.args.NoPublish {
+               if err := w.updateNodeResourceTopology(zones); err != nil {
+                       klog.Warningf("cannot update NodeResourceTopology: %s", err.Error())
+               }
+       }
+}
+
+func (w *nfdTopologyUpdater) aggregateZones(resScan resourcemonitor.ResourcesScanner, resAggr resourcemonitor.ResourcesAggregator) v1alpha1.ZoneList {
+       klog.Infof("Scanning")
+       podResources, err := resScan.Scan()
+       utils.KlogDump(1, "podResources are", "  ", podResources)
+       if err != nil {
+               klog.Warningf("Scan failed: %v", err)
+               return nil
+       }
+
+       return resAggr.Aggregate(podResources)
+}
+

we could just put common logic to new funcs and remove necessity of having separate struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that inline everything might be clearer, but IMVHO decoupling the event source generator from the main loop has greater benefits:
For example you can generate your own source of events if needed, without changing the main loop logic.
This is useful especially for testing, in case you want to mock notification events.
@PiotrProkop WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I have no strong objections against putting events sources behind this abstraction, just wanted to indicate that this could be simplified. Both approaches works for me 😄

@marquiz
Copy link
Contributor

marquiz commented Jan 17, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 17, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Jan 23, 2023

Sorry for the long absence i'll address the comments in the following days

@Tal-or Tal-or force-pushed the reactive_updates branch 2 times, most recently from 996bfe3 to a5e9b8a Compare January 23, 2023 16:46
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 9, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Feb 14, 2023

@marquiz @PiotrProkop I would like to share my thoughts with you about the continuation of this work.
Basically we have #1049 which is a superior supersede for this one.
Once topology-updater will have the pfp (pod fingerprint) the reserve plugin on the scheduler side will track the gaps until the next update, hence we don't need the reactive update.
My question to you is do you see any benefit in having the reactive updates feature anyway?
For example in case that topology-updater would work without scheduler plugin or in any other constellation?

@PiotrProkop
Copy link
Contributor

PiotrProkop commented Feb 14, 2023

@marquiz @PiotrProkop I would like to share my thoughts with you about the continuation of this work. Basically we have #1049 which is a superior supersede for this one. Once topology-updater will have the pfp (pod fingerprint) the reserve plugin on the scheduler side will track the gaps until the next update, hence we don't need the reactive update. My question to you is do you see any benefit in having the reactive updates feature anyway? For example in case that topology-updater would work without scheduler plugin or in any other constellation?

@Tal-or Not everyone are using reserve plugin(for us it's DiscardReservedNodes plugin) and we are using our custom patches in NFD for reactive updates, so we are very interested in this work. If you don't have bandwidth to take care of this PR I would be happy to take over this work and finish it 😄

@Tal-or
Copy link
Contributor Author

Tal-or commented Feb 14, 2023

@marquiz @PiotrProkop I would like to share my thoughts with you about the continuation of this work. Basically we have #1049 which is a superior supersede for this one. Once topology-updater will have the pfp (pod fingerprint) the reserve plugin on the scheduler side will track the gaps until the next update, hence we don't need the reactive update. My question to you is do you see any benefit in having the reactive updates feature anyway? For example in case that topology-updater would work without scheduler plugin or in any other constellation?

@Tal-or Not everyone are using reserve plugin(for us it's DiscardReservedNodes plugin) and we are using our custom patches in NFD for reactive updates, so we are very interested in this work. If you don't have bandwidth to take care of this PR I would be happy to take over this work and finish it smile

@PiotrProkop I'm very happy to hear that you're interesting in this work, then i'll continue it in the coming days.

@marquiz
Copy link
Contributor

marquiz commented Feb 15, 2023

@Tal-or what do you think about the timeline for this? I was thinking to have a bit more predicable/more frequent release cadence and cut v0.13.0 around end of March. Releasing whatever we've managed to get in by that time 😊 It would be great to get this in and let people drop some of the custom patches

@Tal-or
Copy link
Contributor Author

Tal-or commented Feb 15, 2023

@Tal-or what do you think about the timeline for this? I was thinking to have a bit more predicable/more frequent release cadence and cut v0.13.0 around end of March. Releasing whatever we've managed to get in by that time blush It would be great to get this in and let people drop some of the custom patches

Hey @marquiz I have some time today, let me rebase + addressing open comments and let's see how the next review cycle goes

@marquiz
Copy link
Contributor

marquiz commented Feb 15, 2023

Hey @marquiz I have some time today, let me rebase + addressing open comments and let's see how the next review cycle goes

Not super urgent, just wanted to give you head's up

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2023
@Tal-or Tal-or force-pushed the reactive_updates branch 2 times, most recently from b75b46f to 7af3de6 Compare February 15, 2023 13:55
@ffromani
Copy link
Contributor

/cc

Tal-or added 7 commits March 12, 2023 12:37
Enabling reactive update for nfd-topology-updater
by detecting changes in Kubelet state/checkpoint files,
and signaling to the main loop to update the NodeResourceTopology
objects.

This has high value when scaling is an issue.
Having multiple pods deployed in between single update instance
might reflect incorrect resource accounting in the NRT CRs.
Example:
Time Interval = 5s
t0 - New update sent to NRT CRs
t1 - Schedule guaranteed podA
t2 - Schedule guaranteed podB
time elapsed between t0-t2 < 5 seconds,
IOW the update on t0 is the recent update.

In t2 the resource accounting reflected by NRT
is not aligned with the actual accounting because
NRT CRs doesn't reflect the change happened in t1.

With this reactive update feature we expect an update to be trigger
between t1 and t2 so the NRT objects will reflect more accurate
picture.

There still might be a scenario when the updates
aren't fast enough, but this is an additional
future planned optimization.

The notifier has two event types:
1. Time based - keeping the old behavior, trigger
an update per interval.
2. FS event - trigger an update when Kubelet state/checkpoint files modified.

Signed-off-by: Talor Itzhak <[email protected]>
On different Kubernetes flavors like OpenShift for exmaple,
the Kubelet state directory path is different. make it configurable
for maximum flexability.

Signed-off-by: Talor Itzhak <[email protected]>
When a message received via the channel,
the main loop updates the `NodeResourceTopology` objects.

The notifier will send a message via the channel if:
1. It reached the sleep timeout.
2. It detected a change in Kubelet state files

Signed-off-by: Talor Itzhak <[email protected]>
Specify the event type as part of the log message.
In order to reduce the log volume, make it V4

Signed-off-by: Talor Itzhak <[email protected]>
This mount is needed for watching the state files

Signed-off-by: Talor Itzhak <[email protected]>
Especially convenient for testing porpuses and
completely harmless

Signed-off-by: Talor Itzhak <[email protected]>
Signed-off-by: Talor Itzhak <[email protected]>
@Tal-or Tal-or force-pushed the reactive_updates branch from 65882ef to e20cc06 Compare March 12, 2023 11:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 12, 2023
@Tal-or Tal-or force-pushed the reactive_updates branch 3 times, most recently from f261c80 to 2d7890b Compare March 12, 2023 13:08
Tal-or added 3 commits March 16, 2023 11:51
Adding kubelet state directory mount

Signed-off-by: Talor Itzhak <[email protected]>
Access to the kubelet state directory may raise concerns in some setups, added an option to disable it.
The feature is enabled by default.

Signed-off-by: Talor Itzhak <[email protected]>
@Tal-or Tal-or force-pushed the reactive_updates branch from 2d7890b to 5c6be58 Compare March 16, 2023 09:53
Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Tal-or for the persistence on this. I think I don't have any more feedback and would be ready to get this in

ping @ffromani @PiotrProkop

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2023
@ffromani
Copy link
Contributor

thanks for the ping @marquiz I'll review shortly

@PiotrProkop
Copy link
Contributor

Thanks @Tal-or for the persistence on this. I think I don't have any more feedback and would be ready to get this in

ping @ffromani @PiotrProkop

LGTM but I'll leave final approval to @ffromani . Good work @Tal-or !

@marquiz
Copy link
Contributor

marquiz commented Mar 16, 2023

LGTM but I'll leave final approval to @ffromani .

🥳

Good work @Tal-or !

Yes, really good work on this one!

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

very nice work indeed. Kudos to @Tal-or

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 65e359b51ea6ae20c384f683aae3e0046e570f40

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, marquiz, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 13f92fa into kubernetes-sigs:master Mar 17, 2023
@marquiz marquiz mentioned this pull request Apr 12, 2023
24 tasks
@PiotrProkop PiotrProkop mentioned this pull request Apr 24, 2023
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants