Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/tracker: implement analyse cluster participation #846

Merged
merged 8 commits into from
Jul 25, 2022

Conversation

dB2510
Copy link
Contributor

@dB2510 dB2510 commented Jul 23, 2022

Implements analyse cluster participation and participation reporter to calculate cluster participation based on partial signatures exchanged and instruments the results.

category: feature
ticket: #821

@codecov
Copy link

codecov bot commented Jul 23, 2022

Codecov Report

Merging #846 (eb45b19) into main (7e5431c) will decrease coverage by 0.45%.
The diff coverage is 59.61%.

@@            Coverage Diff             @@
##             main     #846      +/-   ##
==========================================
- Coverage   55.30%   54.85%   -0.46%     
==========================================
  Files         111      112       +1     
  Lines       11539    11666     +127     
==========================================
+ Hits         6382     6399      +17     
- Misses       4240     4328      +88     
- Partials      917      939      +22     
Impacted Files Coverage Δ
p2p/peer.go 48.14% <0.00%> (-3.86%) ⬇️
core/tracker/tracker.go 70.40% <62.00%> (-3.97%) ⬇️
core/qbft/qbft.go 71.67% <0.00%> (-10.31%) ⬇️
app/app.go 54.86% <0.00%> (-4.74%) ⬇️
core/aggsigdb/memory.go 79.62% <0.00%> (-0.93%) ⬇️
core/priority/calculate.go 70.78% <0.00%> (ø)
core/dutydb/memory.go 71.91% <0.00%> (+1.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e5431c...eb45b19. Read the comment docs.

"github.com/prometheus/client_golang/prometheus/promauto"
)

var participationGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace: core
Subsystem: tracker
Name: duty_participation (or just participation)

Align label to "duty", "peer" and "pubkey".

@@ -42,11 +46,15 @@ const (
sentinel
)

// shareIdx represents a peer's share index.
type shareIdx int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather not introduce this type, naming var shareIdx int is what we use through the rest of the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i introduced this type to avoid any confusion with int and shareIdx since shareIdx is used in this file very extensively

}

// newParticipationReporter returns a new participation reporter function which logs and instruments peer participation
func newParticipationReporter(peers []p2p.Peer) func(context.Context, core.Duty, map[core.PubKey]map[shareIdx]bool, map[core.PubKey]map[shareIdx]bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalise state my keeping last reported state inside this function, do not require it to kept outside

var absentPeers []string
for _, peer := range peers {
// Peer index is 0 indexed while shareIdx is 1 indexed.
if dvPeers[shareIdx(peer.Index+1)] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a ShareIdx method to Peer maybe

}

// Avoid spamming from identical logs.
if len(absentPeers) > 0 && !reflect.DeepEqual(currentParticipation[pubKey], lastParticipation[pubKey]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best not use reflect in production code.

Also you should also log if some peers didn't participate last time, and all participated this time, e.g. "all peers participated".

Note this is still going to spam the logs if we have 1000 validators in a cluster.

// event represents an event emitted by a core workflow component.
type event struct {
duty core.Duty
component component
pubkey core.PubKey
shareIdx shareIdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a note that this is optional field only set for some component events and list them.

eventsPerDV[e.pubkey] = append(eventsPerDV[e.pubkey], e)
}

for pubKey, dvEvents := range eventsPerDV {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about including pubkey in the whole analysis. Since we group pubkeys by duty, either all the pubkeys will be there that should be there for the peer or none will be there for the peer. It doesn't add any value. I would say we should exclude pubkey from this. Just do duty and peer, not pubkey.

Namespace: "core",
Subsystem: "tracker",
Name: "participation",
Help: "Set to 1 if peer participated successfully for the given duty and Distributed Validator public key or else 0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove pubkey stuff from help

@@ -47,6 +50,7 @@ type event struct {
duty core.Duty
component component
pubkey core.PubKey
shareIdx int // This is an optional field only set by parSigEx and parSigDBInternal events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and validatorapi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe add: ShareIdx is 1 indexed, so 0 indicates unset.

@@ -46,6 +46,11 @@ type Peer struct {
Name string
}

// ShareIdx returns share index of this Peer.
func (p Peer) ShareIdx() int {
return p.Index + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment about 1-indexed

t.failedDutyReporter(duty, failed, failedComponent.String(), failedMsg)

// TODO(dhruv): Case of cluster participation
// analyseParticipation(duty, t.events[duty])
currentParticipation := analyseParticipation(t.events[duty])
Copy link
Contributor

@corverroos corverroos Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: participatedShares := ...

for _, e := range events {
// If we get a parSigDBInternal event, then the current node participated successfully.
// If we get a parSigEx event, then the corresponding peer with e.shareIdx participated successfully.
if e.shareIdx > 0 && (e.component == parSigEx || e.component == parSigDBInternal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather be defensive, error if shareIdx 0 for expected components.

// newParticipationReporter returns a new participation reporter function which logs and instruments peer participation.
func newParticipationReporter(peers []p2p.Peer) func(context.Context, core.Duty, map[int]bool) {
// lastParticipation is the set of peers who participated in the last duty.
lastParticipation := make(map[int]bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prevAbsent := make(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just make this []string and compare against fmt.Sprint(absent) != fmt.Sprint(prevAbsent)

// lastParticipation is the set of peers who participated in the last duty.
lastParticipation := make(map[int]bool)

return func(ctx context.Context, duty core.Duty, currentParticipation map[int]bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: participatedShares

}

// Avoid identical logs if same set of peers didn't participated.
if len(absentPeers) > 0 && !uniqueParticipation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if fmt.Sprint(absent) != fmt.Sprint(prevAbsent) {
  if len(absent) == 0 {
    log.Info(ctx, "All peers participated in duty", z.Any("duty"))
  else {
    log.Info(ctx, "Not all peers participated in duty", z.Any("duty"), z.Any("absent", absentPeers))
   }
    
  }
}

// analyseParticipation(duty, t.events[duty])
participatedShares, err := analyseParticipation(t.events[duty])
if err != nil {
log.Error(ctx, "Invalid participated shares", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, you need to add tracker log topic to ctx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// If we get a parSigEx event, then the corresponding peer with e.shareIdx participated successfully.
if e.component == parSigEx || e.component == parSigDBInternal {
if e.shareIdx == 0 {
return nil, errors.New("invalid shareIdx", z.Any("component", e.component))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"shareIdx empty"

@dB2510 dB2510 added the merge when ready Indicates bulldozer bot may merge when all checks pass label Jul 25, 2022
@obol-bulldozer obol-bulldozer bot merged commit 8f25c30 into main Jul 25, 2022
@obol-bulldozer obol-bulldozer bot deleted the dhruv/participation branch July 25, 2022 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge when ready Indicates bulldozer bot may merge when all checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants