Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

app: push readyz in background #867

Merged
merged 7 commits into from
Jul 29, 2022
Merged

app: push readyz in background #867

merged 7 commits into from
Jul 29, 2022

Conversation

dB2510
Copy link
Contributor

@dB2510 dB2510 commented Jul 27, 2022

This change pushed readyz metrics every 1 second. Earlier we were pushing this metric only when someone calls readyz endpoint. If no-one calls that endpoint it will show inactive on readyz grafana panel.

category: bug
ticket: #880

@@ -73,6 +73,24 @@ func wireMonitoringAPI(life *lifecycle.Manager, addr string, localNode *enode.Lo
life.RegisterStart(lifecycle.AsyncBackground, lifecycle.StartMonitoringAPI, httpServeHook(server.ListenAndServe))
life.RegisterStop(lifecycle.StopMonitoringAPI, lifecycle.HookFunc(server.Shutdown))

go func() {
ticker := time.NewTicker(time.Second)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can discuss on the duration part, what will be the right duration to push metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics are not pushed, they are pull. So you do not need to update more than pull period, which is 15s. Suggest making this 10s.

case <-ctx.Done():
return
case <-ticker.C:
syncing, err := beaconNodeSyncing(ctx, eth2Cl)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoided logging errors here, since that would spam the logs

@@ -73,6 +73,24 @@ func wireMonitoringAPI(life *lifecycle.Manager, addr string, localNode *enode.Lo
life.RegisterStart(lifecycle.AsyncBackground, lifecycle.StartMonitoringAPI, httpServeHook(server.ListenAndServe))
life.RegisterStop(lifecycle.StopMonitoringAPI, lifecycle.HookFunc(server.Shutdown))

go func() {
Copy link
Contributor

@corverroos corverroos Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest extracting a function isReady(ctx) bool which you call here and in newReadyHandler

Copy link
Contributor

@corverroos corverroos Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test for isReady please 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should ONLY have a async "isReady" resolver, and then readyz just reads the latest value..?

@codecov
Copy link

codecov bot commented Jul 27, 2022

Codecov Report

Merging #867 (ced872e) into main (2237124) will decrease coverage by 0.03%.
The diff coverage is 58.73%.

❗ Current head ced872e differs from pull request most recent head b8a97a0. Consider uploading reports for the commit b8a97a0 to get more accurate results

@@            Coverage Diff             @@
##             main     #867      +/-   ##
==========================================
- Coverage   54.73%   54.69%   -0.04%     
==========================================
  Files         113      114       +1     
  Lines       12079    12265     +186     
==========================================
+ Hits         6611     6708      +97     
- Misses       4505     4587      +82     
- Partials      963      970       +7     
Impacted Files Coverage Δ
app/app.go 56.01% <0.00%> (-0.70%) ⬇️
app/monitoringapi.go 66.92% <59.67%> (+37.89%) ⬆️
core/signeddata.go 41.66% <0.00%> (-2.05%) ⬇️
cmd/createcluster.go 49.43% <0.00%> (-1.69%) ⬇️
testutil/beaconmock/beaconmock.go 62.65% <0.00%> (-1.55%) ⬇️
p2p/discovery.go 47.88% <0.00%> (-1.39%) ⬇️
core/leadercast/transport.go 75.14% <0.00%> (-1.19%) ⬇️
core/tracker/tracker.go 59.76% <0.00%> (-0.72%) ⬇️
testutil/beaconmock/options.go 43.49% <0.00%> (-0.42%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2237124...b8a97a0. Read the comment docs.

)
for {
select {
case <-ctx.Done():
return ctx.Err()
case res := <-results:
if res.Error != nil {
continue
errCount++
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added error counter to return error if are unable to ping quorum number of peers

case <-ctx.Done():
return
case <-ticker.Chan():
mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to decrease mutex locked scope to absolute minimum, and remember to NEVER do IO (disk/network) calls while a lock is held.

Comment on lines 162 to 163
actual int
errCount int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest: okCount, errCount


// We wrap the Advance() calls with blockers to make sure that the ticker
// can go to sleep and produce ticks without time passing in parallel.
clock.BlockUntil(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockUntil only works for calls to Sleep and After, not Timer.Chan().. so you'll need to another way.

}
}

func TestStartCheckerPingFail(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of duplication in these tests, can't we make it a table test rather?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure

readyErr = errors.New("beacon node not synced")
readyzGauge.Set(0)
} else if peersReady(ctx, peerIDs, tcpNode) != nil {
readyErr = errors.New("couldn't ping all peers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using error sentinels (var errReadyNoPing = errors.New("blah blah")) and use them for comparison in tests

readyErrFunc := startReadyChecker(ctx, tcpNode, eth2Cl, peerIDs, clockwork.NewRealClock())
mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
readyErr := readyErrFunc()
if readyErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can inline

Copy link
Contributor Author

@dB2510 dB2510 Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but i avoided this because it needs to send response as error string which will result in calling readErrFunc() twice and again asking for mutex.

func startReadyChecker(ctx context.Context, tcpNode host.Host, eth2Cl eth2client.NodeSyncingProvider, peerIDs []peer.ID, clock clockwork.Clock) func() error {
var (
mu sync.Mutex
readyErr error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest initialising the error with non-nil, "errReadyUninit = "ready check uninitialised"

Comment on lines 118 to 120
mu.Lock()
readyErr = errReadySyncing
mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather do locking in one place.

suggest:

		... } else if syncing {
					err = errReadySyncing
					readyzGauge.Set(0)
		} ...
		
		mu.Lock()
		readyErr = err
		mu.Unlock()

@dB2510 dB2510 added the merge when ready Indicates bulldozer bot may merge when all checks pass label Jul 29, 2022
@dB2510 dB2510 linked an issue Jul 29, 2022 that may be closed by this pull request
@obol-bulldozer obol-bulldozer bot merged commit a17bbb4 into main Jul 29, 2022
@obol-bulldozer obol-bulldozer bot deleted the dhruv/callreadyz branch July 29, 2022 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge when ready Indicates bulldozer bot may merge when all checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update readyz gauge asynchronously
2 participants