Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand metrics for errors and warnings #218

Closed
wants to merge 16 commits into from

Conversation

jmcph4
Copy link
Member

@jmcph4 jmcph4 commented Oct 4, 2023

Description

Closes #213.

Notes & open questions

This PR expands the metrics collected by the Discv5 service to include errors and warnings that occur throughout the operation of the service. This is achieved by the introduction of a new type, ErrorMetrics, which stores four elements:

  • Total number of errors
  • Total number of warnings
  • Mapping of specific errors to their associated counts
  • Mapping of specific warnings to their associated counts

A few technical notes:

  • Cumulative counts are stored in addition to individual counts to avoid having to load each AtomicUsize in a manner such as self.errors.read().values().map(|at| at.load(Ordering::Relaxed).sum(). The additional memory overhead is considered negligible as is the added locking overhead
  • Errors and warnings are uniquely identified solely by strings. Presently, these are provided by named constants within the metrics module. Assigning unique numeric identifiers was considered but would add complexity with little benefit (these will eventually be turned back into some string representation when these hit Grafana/Prometheus, anyway).

Notes for reviewers:

  • I'm very open to the names of these error/warning strings changing to be more descriptive/clear, but once again, they are still largely arbitrary
  • With regards to testing, these changes are particularly challenging to test as they exclusively target failure modes of the discovery service. Manual testing, by executing the various examples in the repository, was performed in order to check for basic defects (such as panics, deadlocks, fatal errors, etc.). Additionally, I plumbed this branch into Lighthouse mainline and tested by polling the metrics endpoint (this won't yield any of the metrics in this PR as they need to be added to LH separately, but this nonetheless validates that the discovery service is functioning correctly even with the changes in this PR on a live BN).

Change checklist

  • Self-review
  • Documentation updates if relevant (by way of doc comments)
  • Tests if relevant

@jmcph4 jmcph4 self-assigned this Oct 4, 2023
@jmcph4 jmcph4 added enhancement New feature or request work-in-progress Work in progress labels Oct 4, 2023
@jmcph4 jmcph4 added ready-for-review and removed work-in-progress Work in progress labels Oct 9, 2023
@jmcph4
Copy link
Member Author

jmcph4 commented Oct 9, 2023

Discussed this out-of-band with @AgeManning and we agreed to pivot the design of how Discv5 metrics are to work. As such, this PR is going through a refactor.

@jmcph4 jmcph4 added work-in-progress Work in progress and removed ready-for-review labels Oct 9, 2023
@divagant-martian
Copy link
Collaborator

divagant-martian commented Nov 28, 2023

What was the outcome of keeping per-crate metrics in lighthouse instead? what's the future of this pr in light of that outcome?

@AgeManning
Copy link
Member

I believe the state should be like this:

We keep this PR. We add metrics to discv5 (and to libp2p) that gives us a count of errors and their severity. I.e number of crits, errors, warns.
In conjunction we implement: sigp/lighthouse#4954 which allows us to store seperate log files for discv5 and libp2p.

The idea then would be that we can check our metrics dash to know if there are any underlying problems and we can then go to the log files to investigate further.

This PR needs to simply count the error logs.

@AgeManning
Copy link
Member

AgeManning commented Nov 29, 2023

A simple way of doing this, it to replace the warn!, error! and crit! macros with ones that increment a metric count. No further code would need to change.

Perhaps @armaganyildirak may be interested in this.

@divagant-martian
Copy link
Collaborator

I expressed this via other channels but will leave this here for the record.

I don't think this is the right approach to solve this issue. If we end up doing this it should be as a last resort since doing this is highly polluting to the code and very hard to maintain. I proposed a different approach: to build a tracing::Subscriber that records metrics in LH itself.

Since I have not seen this implemented yet, I'll give it a go and report back.

In case I find the proposed approach not doable or equally hard to maintain, we could continue with providing these metrics from discv5. However, this should be done with macros, per log level, and ignoring the message of the log to reduce the polluting and maintenance work. (@AgeManning 's last comment, basically) Both scenarios diverge enough from the current PR to still consider it reasonable to build on top of this one even in the second scenario.

This being said, I'm closing this PR. Second scenario should start a new one if that's where we land

@divagant-martian
Copy link
Collaborator

alternative here: sigp/lighthouse#4979 it ended up being fairly simple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request work-in-progress Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Error Metric
3 participants