Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add histogram metrics for block arrival time #5708

Merged
merged 19 commits into from
May 2, 2020

Conversation

terencechain
Copy link
Member

@terencechain terencechain commented May 1, 2020

What type of PR is this?

Feature

What does this PR do? Why is it needed?
Pre read: https://hackmd.io/dVbmIMHNQ6aby77g0-ME8A

This PR adds useful metrics to capture beacon blocks propagation time on arrival. It only captures once per slot so not too bad.
I think this will be really helpful to roll out to cluster for observing network stability

blocks_arrival_in_seconds_total_bucket{le="1"} 3
blocks_arrival_in_seconds_total_bucket{le="2"} 15
blocks_arrival_in_seconds_total_bucket{le="3"} 20
blocks_arrival_in_seconds_total_bucket{le="4"} 28
blocks_arrival_in_seconds_total_bucket{le="5"} 29
blocks_arrival_in_seconds_total_bucket{le="6"} 29
blocks_arrival_in_seconds_total_bucket{le="+Inf"} 29
blocks_arrival_in_seconds_total_sum 79

Which issues(s) does this PR fix?

Helped to debug #5665

@terencechain terencechain requested a review from a team as a code owner May 1, 2020 16:38
@terencechain terencechain self-assigned this May 1, 2020
@prestonvanloon
Copy link
Member

Why not use a histogram metric? https://prometheus.io/docs/concepts/metric_types/

@terencechain
Copy link
Member Author

Why not use a histogram metric? https://prometheus.io/docs/concepts/metric_types/

I didn't know about histogram. Will look into this

@terencechain terencechain marked this pull request as draft May 1, 2020 16:42
@codecov
Copy link

codecov bot commented May 1, 2020

Codecov Report

Merging #5708 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5708   +/-   ##
=======================================
  Coverage   48.24%   48.24%           
=======================================
  Files         241      241           
  Lines       21656    21656           
=======================================
  Hits        10449    10449           
  Misses       9334     9334           
  Partials     1873     1873           

@terencechain terencechain changed the title Add metrics for block arrival time Add histogram metrics for block arrival time May 1, 2020
@terencechain terencechain marked this pull request as ready for review May 1, 2020 17:41
prometheus.HistogramOpts{
Name: "block_arrival_latency_milliseconds",
Help: "Captures blocks propagation time. Blocks arrival in milliseconds distribution",
Buckets: []float64{1, 2, 3, 4, 5, 6},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to update these buckets to be milliseconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Nice catch. I just fixed it.

Really need ☕ ....

@terencechain
Copy link
Member Author

Want to use UnixNano? That would be more accurate.

great suggestion!

@prestonvanloon prestonvanloon mentioned this pull request May 1, 2020
}
diff := uint64(roughtime.Now().UnixNano() - startTime.UnixNano())

// Denominate everything in milliseconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be microseconds. If you use the time API as suggested above, it’s more clear

1 microsecond is 1000ns.

if err != nil {
return err
}
diff := uint64(roughtime.Now().UnixNano() - startTime.UnixNano())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using the time API?

diffMs := startTime.Sub(roughtime.Now()) / time.Milliseconds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh nice. Getting a crash course on time pkg today...

I think it's suppose to be roughtime.Now().Sub(startTime)

if err != nil {
return err
}
diffMs := roughtime.Now().Sub(startTime) / time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is backwards and you’ll have negative numbers. Did you confirm at runtime?

@prestonvanloon
Copy link
Member

LGTM, tested locally and here's what I got after a few slots.

block_arrival_latency_milliseconds_bucket{le="1000"} 6
block_arrival_latency_milliseconds_bucket{le="2000"} 8
block_arrival_latency_milliseconds_bucket{le="3000"} 10
block_arrival_latency_milliseconds_bucket{le="4000"} 10
block_arrival_latency_milliseconds_bucket{le="5000"} 10
block_arrival_latency_milliseconds_bucket{le="6000"} 10
block_arrival_latency_milliseconds_bucket{le="+Inf"} 10
block_arrival_latency_milliseconds_sum 12299
block_arrival_latency_milliseconds_count 10

@prestonvanloon prestonvanloon merged commit 4296127 into master May 2, 2020
@delete-merged-branch delete-merged-branch bot deleted the arrival-time-metrics branch May 2, 2020 02:35
@terencechain
Copy link
Member Author

LGTM, tested locally and here's what I got after a few slots.

block_arrival_latency_milliseconds_bucket{le="1000"} 6
block_arrival_latency_milliseconds_bucket{le="2000"} 8
block_arrival_latency_milliseconds_bucket{le="3000"} 10
block_arrival_latency_milliseconds_bucket{le="4000"} 10
block_arrival_latency_milliseconds_bucket{le="5000"} 10
block_arrival_latency_milliseconds_bucket{le="6000"} 10
block_arrival_latency_milliseconds_bucket{le="+Inf"} 10
block_arrival_latency_milliseconds_sum 12299
block_arrival_latency_milliseconds_count 10

Oh wow. You don't get many delayed blocks compare to mine...

blocks_arrival_in_seconds_total_bucket{le="1"} 3
blocks_arrival_in_seconds_total_bucket{le="2"} 15
blocks_arrival_in_seconds_total_bucket{le="3"} 20
blocks_arrival_in_seconds_total_bucket{le="4"} 28
blocks_arrival_in_seconds_total_bucket{le="5"} 29
blocks_arrival_in_seconds_total_bucket{le="6"} 29

@prestonvanloon
Copy link
Member

Looking forward to monitoring over time, but yeah I see most blocks arriving in the first 2 buckets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants