-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add discovery duration logging #1055
Add discovery duration logging #1055
Conversation
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Hi @VillePihlava. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @VillePihlava, this is a nice improvement. The KlogDump prints sources in deterministic order which is nice.
What we'd like is to have similar information for hooks run in getFeaturesFromGooks
( in source/loca/local.go
) because those are the most likely source of "surprises" (even if they are deprecated). This can be addressed in a separate PR, though
pkg/nfd-worker/nfd-worker.go
Outdated
} | ||
utils.KlogDump(3, "discovery durations for feature sources:", " ", durations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of using intermediate map & printing things only after whole loop, instead of just printing header before loop, moving Infof() printing after Discover(), and adding time to that log item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the deterministic order comment by @marquiz, current Infof() printing within the loop has higher logging priority than this, so it will be printed always when this is, right? I think it will be confusing to have items printed twice, but in different order... IMHO it's better to join the messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a good point 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the Infof
before Discover
and added logging after which includes the duration.
@VillePihlava While this could be a step into direction requested in #899, it does not yet do any of the suggested 3 items. #899 requires deducting from the sleep used between discovery rounds, the time already spent on that interval from core.sleepInterval value, so that discoveries actually happen on the specified interval (or using time ticker to handle that, as suggested by @marquiz in #899). EDIT: I hadn't noticed 2101cb2 going in, which actually adds the ticker usage. That's the PR that should have closed #899, as extra warnings / better logging was optional part of that ticket. |
To finish #899 (now that 2101cb2 is merged), this should check (also) the total time used for a discovery round, and e.g. log an error for discovery rounds that exceed sleep interval, and/or a warning if total (+ any individual discoveries?) is (say) >50% of it. I'm not sure which one is more important; an error telling that discoveries, or discovery interval, is not working, or an early warning of approaching the limit. Because that is easy to fix just by increasing |
I'll look into this. Also added a link to the other pull request to avoid confusion in the original message. |
We could log a warning. I wouldn't tie this to the sleep interval but put some reasonable hard limit, e.g. the discovery really shouldn't take more than 1 or 2 seconds in any scenario, I think. If you put sleep interval to 1 hour and the discovery takes 20 minutes something is really badly wrong (IMHO) |
Why? That would be a deliberate user decision. It could just be something naturally slow, but not very taxing on resources (maybe not RFC 2549, but something). |
I agree with @marquiz , discovery usually takes seconds, we should aim for lowering the foot print of NFD, allowing minutes won't move us in that direction Aside from that, this PR looks good to me |
I think short enough default If user increases the interval from default, he's likely to have a good reason for it. |
Umm Maybe for custom Hooks, but we are deprecating hooks, if user wants to discover something that they know takes time, is better for them to use a side car that publishes to the NodeFeature CR, than to increasing the discovery time toleration on NFD core |
I think the usual reason is something else than preparing for that the nfd-worker is churning feature discovery for ages. As @ArangoGutierrez noted hooks are being deprecated anyway so I wouldn't spend too much time on this. The built-in discovery should be in the order of milliseconds (sub-second anyway) so anything longer than a few seconds is raising concnerns |
3d0fc26
to
0019029
Compare
0019029
to
5d24ecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @VillePihlava, looks good to me.
ping @eero-t
The warning is displayed if the duration for all sources exceeds 5 seconds. I updated the initial comment for an example of the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Unhold at will @marquiz
/hold
LGTM label has been added. Git tree hash: 35e9e6348bf33928f53c48a24144b74bb9499179
|
5d24ecc
to
e558c89
Compare
Noticed that the log levels were different, I changed both to 2 now |
@marquiz unfortunately you both missed the point. If interval is 1 second, doing warning only when discovery actually takes >2 seconds is not really useful. Warning should be logged before the whole interval is used. |
pkg/nfd-worker/nfd-worker.go
Outdated
if discoveryDuration > 5*time.Second { | ||
klog.Warningf("feature discovery for sources lasted for more than 5s (%v)", discoveryDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning at 5s is not useful when the discovery interval is smaller. I would go for something like this (untested!):
if discoveryDuration > 5*time.Second { | |
klog.Warningf("feature discovery for sources lasted for more than 5s (%v)", discoveryDuration) | |
if discoveryDuration > core.sleepInterval/2 { | |
klog.Warningf("Feature discovery sources took over half (%v) of sleep interval (%v)", discoveryDuration, core.sleepInterval) |
(Other alternative would be having measurement and warning like this where the sleep is done instead.)
PS. commit message is missing "Fixes:" tag for #899.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VillePihlava please address, I agree with @eero-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eero-t PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now changed. A fixes tag is not allowed in the commit message in this repository apparently, the original comment has it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added a check to see if the duration is positive (negative durations should not cause a warning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, looks fine now!
pkg/nfd-worker/nfd-worker.go
Outdated
} | ||
|
||
discoveryDuration := time.Since(discoveryStart) | ||
klog.V(2).Infof("feature discovery for sources lasted for: %v", discoveryDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For improved log output amount control, per-source logging could be V(3), and this V(2), with count included to it:
klog.V(2).Infof("feature discovery for sources lasted for: %v", discoveryDuration) | |
klog.V(2).Infof("feature discovery for %d sources lasted for: %v", len(w.featureSources), discoveryDuration) |
(Just a suggestion, feel free to ignore.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VillePihlava please address, I agree with @eero-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eero-t PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eero-t PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine now!
9e5479b
to
714d3d6
Compare
714d3d6
to
52813e9
Compare
52813e9
to
b1c6b22
Compare
/lgtm |
@eero-t: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 29770799c7e6c844416c4b5ff46965da637981e0
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ArangoGutierrez, marquiz, VillePihlava The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Looks like everybody consent on merging |
This pull request adds logging of discovery durations. There are some optional things mentioned in #899 by @eero-t. I had a quick discussion with @marquiz and if I understood correctly something like what is implemented here could also be useful. Depending on what exactly is wanted, this could close #899. This is related to work done in #1050.
The logging output can look like this:
Fixes #899