Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add discovery duration logging #1055

Merged

Conversation

VillePihlava
Copy link
Contributor

@VillePihlava VillePihlava commented Feb 10, 2023

This pull request adds logging of discovery durations. There are some optional things mentioned in #899 by @eero-t. I had a quick discussion with @marquiz and if I understood correctly something like what is implemented here could also be useful. Depending on what exactly is wanted, this could close #899. This is related to work done in #1050.

The logging output can look like this:

<output from other sources>
I0213 12:42:32.317317 2204667 nfd-worker.go:199] discovery duration for "system" source: 472.156µs
I0213 12:42:32.318310 2204667 usb.go:150] discovered usb features:
I0213 10:51:09.882921 2197292 usb.go:150] <many lines of output>
I0213 12:42:32.318726 2204667 nfd-worker.go:199] discovery duration for "usb" source: 1.397359ms
I0213 12:42:32.318736 2204667 nfd-worker.go:203] feature discovery for 9 sources lasted for: 6s
W0213 12:42:32.318745 2204667 nfd-worker.go:205] feature discovery sources took over half (6s) of sleep interval (10s)
I0213 12:42:32.318755 2204667 nfd-worker.go:561] starting feature discovery...

Fixes #899

@netlify
Copy link

netlify bot commented Feb 10, 2023

Deploy Preview for kubernetes-sigs-nfd ready!

Name Link
🔨 Latest commit b1c6b22
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-nfd/deploys/63ea17be5be97a00096d114b
😎 Deploy Preview https://deploy-preview-1055--kubernetes-sigs-nfd.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @VillePihlava. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 10, 2023
Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VillePihlava, this is a nice improvement. The KlogDump prints sources in deterministic order which is nice.

What we'd like is to have similar information for hooks run in getFeaturesFromGooks( in source/loca/local.go) because those are the most likely source of "surprises" (even if they are deprecated). This can be addressed in a separate PR, though

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2023
}
utils.KlogDump(3, "discovery durations for feature sources:", " ", durations)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of using intermediate map & printing things only after whole loop, instead of just printing header before loop, moving Infof() printing after Discover(), and adding time to that log item?

Copy link

@eero-t eero-t Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the deterministic order comment by @marquiz, current Infof() printing within the loop has higher logging priority than this, so it will be printed always when this is, right? I think it will be confusing to have items printed twice, but in different order... IMHO it's better to join the messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a good point 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Infof before Discover and added logging after which includes the duration.

@eero-t
Copy link

eero-t commented Feb 10, 2023

Depending on what exactly is wanted, this could close #899

@VillePihlava While this could be a step into direction requested in #899, it does not yet do any of the suggested 3 items.

#899 requires deducting from the sleep used between discovery rounds, the time already spent on that interval from core.sleepInterval value, so that discoveries actually happen on the specified interval (or using time ticker to handle that, as suggested by @marquiz in #899).

EDIT: I hadn't noticed 2101cb2 going in, which actually adds the ticker usage. That's the PR that should have closed #899, as extra warnings / better logging was optional part of that ticket.

@eero-t
Copy link

eero-t commented Feb 10, 2023

To finish #899 (now that 2101cb2 is merged), this should check (also) the total time used for a discovery round, and e.g. log an error for discovery rounds that exceed sleep interval, and/or a warning if total (+ any individual discoveries?) is (say) >50% of it.

I'm not sure which one is more important; an error telling that discoveries, or discovery interval, is not working, or an early warning of approaching the limit. Because that is easy to fix just by increasing core.sleepInterval, I do not think these messages would need rate-limiting.

@VillePihlava
Copy link
Contributor Author

To finish #899 (now that 2101cb2 is merged), this should check (also) the total time used for a discovery round, and e.g. log an error for discovery rounds that exceed sleep interval, and/or a warning if total (+ any individual discoveries?) is (say) >50% of it.

I'm not sure which one is more important; an error telling that discoveries, or discovery interval, is not working, or an early warning of approaching the limit. Because that is easy to fix just by increasing core.sleepInterval, I do not think these messages would need rate-limiting.

I'll look into this. Also added a link to the other pull request to avoid confusion in the original message.

@marquiz
Copy link
Contributor

marquiz commented Feb 10, 2023

To finish #899 (now that 2101cb2 is merged), this should check (also) the total time used for a discovery round, and e.g. log an error for discovery rounds that exceed sleep interval, and/or a warning if total (+ any individual discoveries?) is (say) >50% of it.

We could log a warning. I wouldn't tie this to the sleep interval but put some reasonable hard limit, e.g. the discovery really shouldn't take more than 1 or 2 seconds in any scenario, I think. If you put sleep interval to 1 hour and the discovery takes 20 minutes something is really badly wrong (IMHO)

@eero-t
Copy link

eero-t commented Feb 10, 2023

If you put sleep interval to 1 hour and the discovery takes 20 minutes something is really badly wrong (IMHO)

Why? That would be a deliberate user decision. It could just be something naturally slow, but not very taxing on resources (maybe not RFC 2549, but something).

@ArangoGutierrez
Copy link
Contributor

If you put sleep interval to 1 hour and the discovery takes 20 minutes something is really badly wrong (IMHO)

Why? That would be a deliberate user decision. It could just be something naturally slow, but not very taxing on resources (maybe not RFC 2549, but something).

I agree with @marquiz , discovery usually takes seconds, we should aim for lowering the foot print of NFD, allowing minutes won't move us in that direction

Aside from that, this PR looks good to me

@eero-t
Copy link

eero-t commented Feb 10, 2023

I agree with @marquiz , discovery usually takes seconds, we should aim for lowering the foot print of NFD, allowing minutes won't move us in that direction

I think short enough default sleepInterval, and giving the warning much earlier (e.g. 10% of interval), would work better than using a fixed time limit for when warning is given.

If user increases the interval from default, he's likely to have a good reason for it.

@ArangoGutierrez
Copy link
Contributor

I agree with @marquiz , discovery usually takes seconds, we should aim for lowering the foot print of NFD, allowing minutes won't move us in that direction

I think short enough default sleepInterval, and giving the warning much earlier (e.g. 10% of interval), would work better than using a fixed time limit for when warning is given.

If user increases the interval from default, he's likely to have a good reason for it.

Umm Maybe for custom Hooks, but we are deprecating hooks, if user wants to discover something that they know takes time, is better for them to use a side car that publishes to the NodeFeature CR, than to increasing the discovery time toleration on NFD core

@marquiz
Copy link
Contributor

marquiz commented Feb 10, 2023

If user increases the interval from default, he's likely to have a good reason for it.

I think the usual reason is something else than preparing for that the nfd-worker is churning feature discovery for ages. As @ArangoGutierrez noted hooks are being deprecated anyway so I wouldn't spend too much time on this. The built-in discovery should be in the order of milliseconds (sub-second anyway) so anything longer than a few seconds is raising concnerns

@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch from 3d0fc26 to 0019029 Compare February 13, 2023 08:44
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 13, 2023
@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch from 0019029 to 5d24ecc Compare February 13, 2023 08:50
Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VillePihlava, looks good to me.

ping @eero-t

@VillePihlava
Copy link
Contributor Author

The warning is displayed if the duration for all sources exceeds 5 seconds. I updated the initial comment for an example of the output.

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Unhold at will @marquiz
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 35e9e6348bf33928f53c48a24144b74bb9499179

@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch from 5d24ecc to e558c89 Compare February 13, 2023 09:28
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2023
@VillePihlava
Copy link
Contributor Author

Noticed that the log levels were different, I changed both to 2 now

@eero-t
Copy link

eero-t commented Feb 13, 2023

As @ArangoGutierrez noted hooks are being deprecated anyway so I wouldn't spend too much time on this. The built-in discovery should be in the order of milliseconds (sub-second anyway) so anything longer than a few seconds is raising concnerns

@marquiz unfortunately you both missed the point. If interval is 1 second, doing warning only when discovery actually takes >2 seconds is not really useful. Warning should be logged before the whole interval is used.

Comment on lines 204 to 205
if discoveryDuration > 5*time.Second {
klog.Warningf("feature discovery for sources lasted for more than 5s (%v)", discoveryDuration)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning at 5s is not useful when the discovery interval is smaller. I would go for something like this (untested!):

Suggested change
if discoveryDuration > 5*time.Second {
klog.Warningf("feature discovery for sources lasted for more than 5s (%v)", discoveryDuration)
if discoveryDuration > core.sleepInterval/2 {
klog.Warningf("Feature discovery sources took over half (%v) of sleep interval (%v)", discoveryDuration, core.sleepInterval)

(Other alternative would be having measurement and warning like this where the sleep is done instead.)

PS. commit message is missing "Fixes:" tag for #899.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VillePihlava please address, I agree with @eero-t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eero-t PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now changed. A fixes tag is not allowed in the commit message in this repository apparently, the original comment has it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added a check to see if the duration is positive (negative durations should not cause a warning)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, looks fine now!

}

discoveryDuration := time.Since(discoveryStart)
klog.V(2).Infof("feature discovery for sources lasted for: %v", discoveryDuration)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For improved log output amount control, per-source logging could be V(3), and this V(2), with count included to it:

Suggested change
klog.V(2).Infof("feature discovery for sources lasted for: %v", discoveryDuration)
klog.V(2).Infof("feature discovery for %d sources lasted for: %v", len(w.featureSources), discoveryDuration)

(Just a suggestion, feel free to ignore.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VillePihlava please address, I agree with @eero-t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eero-t PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eero-t PTAL

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine now!

@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch 2 times, most recently from 9e5479b to 714d3d6 Compare February 13, 2023 10:48
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Feb 13, 2023
@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch from 714d3d6 to 52813e9 Compare February 13, 2023 10:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Feb 13, 2023
@VillePihlava VillePihlava force-pushed the discovery-duration-logging branch from 52813e9 to b1c6b22 Compare February 13, 2023 10:58
@eero-t
Copy link

eero-t commented Feb 13, 2023

/lgtm

@k8s-ci-robot
Copy link
Contributor

@eero-t: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 29770799c7e6c844416c4b5ff46965da637981e0

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ArangoGutierrez, marquiz, VillePihlava

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@marquiz
Copy link
Contributor

marquiz commented Feb 13, 2023

Looks like everybody consent on merging
/unhold
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit f4c6d59 into kubernetes-sigs:master Feb 13, 2023
@marquiz marquiz mentioned this pull request Apr 12, 2023
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve "core.sleepInterval" handling
5 participants