Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/bcast: add recaster to rebroadcast registrations every epoch #1008

Merged
merged 5 commits into from
Aug 22, 2022

Conversation

corverroos
Copy link
Contributor

@corverroos corverroos commented Aug 20, 2022

Adds the recaster component to rebroadcast builder registrations every epoch.

category: feature
ticket: #1009

@codecov
Copy link

codecov bot commented Aug 20, 2022

Codecov Report

Merging #1008 (c558417) into main (d0121eb) will decrease coverage by 0.71%.
The diff coverage is 23.36%.

@@            Coverage Diff             @@
##             main    #1008      +/-   ##
==========================================
- Coverage   54.25%   53.54%   -0.72%     
==========================================
  Files         117      119       +2     
  Lines       13130    13292     +162     
==========================================
- Hits         7124     7117       -7     
- Misses       4983     5146     +163     
- Partials     1023     1029       +6     
Impacted Files Coverage Δ
core/bcast/recast.go 0.00% <0.00%> (ø)
core/interfaces.go 0.00% <0.00%> (ø)
core/types.go 32.81% <0.00%> (-3.71%) ⬇️
core/scheduler/scheduler.go 73.36% <66.66%> (+0.46%) ⬆️
app/app.go 58.75% <100.00%> (-0.11%) ⬇️
core/scheduler/metrics.go 100.00% <100.00%> (ø)
core/qbft/qbft.go 71.67% <0.00%> (-10.31%) ⬇️
core/leadercast/transport.go 75.14% <0.00%> (-1.19%) ⬇️
cmd/bootnode.go 31.72% <0.00%> (-0.45%) ⬇️
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ciaranmcveigh5
Copy link
Contributor

ciaranmcveigh5 commented Aug 20, 2022

run in ropsten cluster appear to be registering each epoch as intended beacon is keeping validators in its registered pool only strange behaviour was to do with /teku-proposer-config endpoint

2022-08-20 15:54:19.686 FATAL - Failed to load proposer config from: http://node0:3600/teku_proposer_config
tech.pegasys.teku.infrastructure.exceptions.InvalidConfigurationException: Failed to load proposer config from: http://node0:3600/teku_proposer_config
at tech.pegasys.teku.validator.client.proposerconfig.loader.ProposerConfigLoader.getProposerConfig(ProposerConfigLoader.java:42) ~[teku-validator-client-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.validator.client.proposerconfig.UrlProposerConfigProvider.internalGetProposerConfig(UrlProposerConfigProvider.java:37) ~[teku-validator-client-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.validator.client.proposerconfig.AbstractProposerConfigProvider.lambda$getProposerConfig$0(AbstractProposerConfigProvider.java:74) ~[teku-validator-client-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.infrastructure.async.SafeFuture.of(SafeFuture.java:81) ~[teku-infrastructure-async-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.infrastructure.async.AsyncRunner.lambda$runAsync$2(AsyncRunner.java:38) ~[teku-infrastructure-async-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.infrastructure.async.SafeFuture.of(SafeFuture.java:73) ~[teku-infrastructure-async-develop.jar:22.8.0+66-gff99b3f]
at tech.pegasys.teku.infrastructure.async.ScheduledExecutorAsyncRunner.lambda$createRunnableForAction$1(ScheduledExecutorAsyncRunner.java:119) ~[teku-infrastructure-async-develop.jar:22.8.0+66-gff99b3f]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: com.fasterxml.jackson.databind.exc.ValueInstantiationException: Cannot construct instance of `tech.pegasys.teku.spec.datastructures.eth1.Eth1Address`, problem: Bytes20 should be 20 bytes, but was 0 bytes.
 at [Source: (URL); line: 1, column: 139] (through reference chain: tech.pegasys.teku.validator.client.ProposerConfig["proposer_config"]->java.util.LinkedHashMap["0x805debe30e3e942e8b9b172a45b226b3b3eecfcfd63fde314003521e4d87dd35214ea44343c1151d6914184b59367d4b"]->tech.pegasys.teku.validator.client.ProposerConfig$Config["fee_recipient"])
 ...
 Failed to load proposer config from: http://node0:3600/teku_proposer_config

also get a lot of these in the logs

20:32:55.619 ERRO sched      Rebroadcast duty error (will retry next epoch): failed to submit validator registration: failed to call POST endpoint: Post "http://teku.ropsten.svc.cluster.local:5051/eth/v1/validator/register_validator": context deadline exceeded {"duty": "0/builder_registration", "slot": 591648}
20:32:55.620 ERRO sched      Rebroadcast duty error (will retry next epoch): failed to submit validator registration: failed to call POST endpoint: Post "http://teku.ropsten.svc.cluster.local:5051/eth/v1/validator/register_validator": context deadline exceeded {"duty": "0/builder_registration", "slot": 591360}
20:32:55.620 ERRO sched      Rebroadcast duty error (will retry next epoch): failed to submit validator registration: failed to call POST endpoint: Post "http://teku.ropsten.svc.cluster.local:5051/eth/v1/validator/register_validator": context deadline exceeded {"duty": "0/builder_registration", "slot": 591872}
20:32:55.620 ERRO sched      Rebroadcast duty error (will retry next epoch): failed to submit validator registration: failed to call POST endpoint: Post "http://teku.ropsten.svc.cluster.local:5051/eth/v1/validator/register_validator": context deadline exceeded {"duty": "0/builder_registration", "slot": 591904}

image

length should be 100 - drop from 96 to 81 is end of an epoch so some validators will have had to have failed their re-broadcast in 3 consecutive epochs

@corverroos
Copy link
Contributor Author

That teku error seems due to empty "fee_recipient" being returned by teku-proposer-config endpoint. Probably due to it being empty in the cluster lock.

wrt "failed to submit validator registration", see the beacon API latency metrics. It seems like the beacon node is timing out after 2s. Strange that the register endpoint is so slow?

defer r.mu.Unlock()

tuple, ok := r.tuples[pubkey]
if ok && tuple.duty.Slot >= duty.Slot {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a ||?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

r.mu.Lock()
for k, v := range r.tuples {
clonedTuples[k] = v
clonedSubs = append(clonedSubs, r.subs...)
Copy link
Contributor

@xenowits xenowits Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't understand why cloning subs inside the loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciaranmcveigh5 this might have caused some of the latency issue you saw, since we bombarded BN with duplicate rebroadcasts

s.scheduleSlot(slotCtx, slot)
}
}
}

func (s *Scheduler) emitCoreSlot(ctx context.Context, slot core.Slot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add godoc


// FirstInEpoch returns true if this is the first slot in the epoch.
func (s Slot) FirstInEpoch() bool {
return s.Slot%s.SlotsPerEpoch == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we assume that s.Slot is always a positive integer? coz this won't work for -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 would return false and that is correct no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is correct. the testcase i had in my mind was epoch -1, ie, slots from -32...-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that is a valid test case

@corverroos corverroos added the merge when ready Indicates bulldozer bot may merge when all checks pass label Aug 22, 2022
@obol-bulldozer obol-bulldozer bot merged commit 5e0575a into main Aug 22, 2022
@obol-bulldozer obol-bulldozer bot deleted the corver/recast branch August 22, 2022 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge when ready Indicates bulldozer bot may merge when all checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants