Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect voting of head #5665

Closed
terencechain opened this issue Apr 28, 2020 · 12 comments
Closed

Incorrect voting of head #5665

terencechain opened this issue Apr 28, 2020 · 12 comments
Assignees

Comments

@terencechain
Copy link
Member

terencechain commented Apr 28, 2020

💎 Issue

Background

There has been increasing amount of users brought up the concern that validators incorrectly voted on head. While this issue could be expected for multiple reasons. It's still worth to double check and narrow down the root cause. On average I do see my validators voted head incorrectly ~10-15% Whether it's local view of the chain is consistently not canonical thus the head is wrong or something else. Opening this up for investigation and discussion

@Phistr90
Copy link

I have a validator client with 1 validator running that is constantly voting incorrectly on the head but is mostly voting correct on target and source and almost always fulfilling its attestation duties.

Maybe a time offset might be a reason? I have just figured out that this system time is 3s behind.
Maybe a bottleneck in processing power ? I see this behavior on a RPi4.

@terencechain
Copy link
Member Author

terencechain commented Apr 28, 2020

Brainstormed a few conditions:
1.) Slow propagation of beacon block over the network, if the block at slot N did not arrive before slot N + SECONDS_PER_SLOT * 1/3
2.) Slow processing of beacon block on the beacon node locally. Condition 1 + the block at slot N did not get processed before slot N + SECONDS_PER_SLOT * 1/3
3.) A bug in fork choice or fork choice is slow

I think it's a mixture between 1 and 2. Will collect more data points

@torfbolt
Copy link

@Phistr90 this sounds a lot like the issues described in #5247 which I had in the previous testnet. Do you get similar error messages? Maybe there is some kind of bottleneck or inefficient algorithm for arm64, I was also running on ARM.

@terencechain
Copy link
Member Author

terencechain commented Apr 29, 2020

Time offset is one of the root cause here. See: https://discordapp.com/channels/476244492043812875/667761857209040916/704908725185544192

With that said, Im not satisfied as the only root cause. Will dig more into this today

@terencechain
Copy link
Member Author

The is the reason: https://hackmd.io/dVbmIMHNQ6aby77g0-ME8A?both

Still working on it

@terencechain
Copy link
Member Author

As mentioned in previous comment. Beacon block propagation is the root cause here. This is not a protocol issue. The question we need to answer here is "why some blocks arrive later than 6s?"

Will leave this open and monitor through out the week.

@Phistr90
Copy link

Phistr90 commented May 3, 2020

It keeps getting worse. Now also my validators on my Intel Nuc are mostly (about 70%, validators changing) voting on incorrect heads. My Internet connection is stable and has high bandwidth. It was perfectly fine on my Intel nuc for the first 1.5 weeks.

Upping my max peer count from default to 60 now.

@terencechain
Copy link
Member Author

@Phistr90 the issue here is not from your node, it's rest of the nodes your node peered with.

What do you see with the following metric?

@Phistr90
Copy link

Phistr90 commented May 3, 2020

@terencechain Which metric?

@terencechain
Copy link
Member Author

@terencechain Which metric?

block_arrival_latency_milliseconds histogram

Reference: #5708

@Phistr90
Copy link

Phistr90 commented May 3, 2020

# HELP block_arrival_latency_milliseconds Captures blocks propagation time. Blocks arrival in milliseconds distribution
# TYPE block_arrival_latency_milliseconds histogram
block_arrival_latency_milliseconds_bucket{le="1000"} 1
block_arrival_latency_milliseconds_bucket{le="2000"} 9
block_arrival_latency_milliseconds_bucket{le="3000"} 91
block_arrival_latency_milliseconds_bucket{le="4000"} 113
block_arrival_latency_milliseconds_bucket{le="5000"} 133
block_arrival_latency_milliseconds_bucket{le="6000"} 141
block_arrival_latency_milliseconds_bucket{le="+Inf"} 149
block_arrival_latency_milliseconds_sum 483051
block_arrival_latency_milliseconds_count 149
# HELP p2p_peer_count The number of peers in a given state.
# TYPE p2p_peer_count gauge
p2p_peer_count{state="Bad"} 199
p2p_peer_count{state="Connected"} 11
p2p_peer_count{state="Connecting"} 0
p2p_peer_count{state="Disconnected"} 404
p2p_peer_count{state="Disconnecting"} 0
# HELP p2p_topic_peer_count The number of peers subscribed to a given topic.
# TYPE p2p_topic_peer_count gauge
p2p_topic_peer_count{topic="/eth2/f071c66c/attester_slashing/ssz_snappy"} 11
p2p_topic_peer_count{topic="/eth2/f071c66c/beacon_aggregate_and_proof/ssz_snappy"} 11
p2p_topic_peer_count{topic="/eth2/f071c66c/beacon_block/ssz_snappy"} 11
p2p_topic_peer_count{topic="/eth2/f071c66c/committee_index0_beacon_attestation/ssz_snappy"} 3
p2p_topic_peer_count{topic="/eth2/f071c66c/committee_index1_beacon_attestation/ssz_snappy"} 5
p2p_topic_peer_count{topic="/eth2/f071c66c/committee_index2_beacon_attestation/ssz_snappy"} 4
p2p_topic_peer_count{topic="/eth2/f071c66c/committee_index3_beacon_attestation/ssz_snappy"} 2
p2p_topic_peer_count{topic="/eth2/f071c66c/committee_index5_beacon_attestation/ssz_snappy"} 4
p2p_topic_peer_count{topic="/eth2/f071c66c/proposer_slashing/ssz_snappy"} 11
p2p_topic_peer_count{topic="/eth2/f071c66c/voluntary_exit/ssz_snappy"} 11

Needed to rebuild and restart. Will collect more data but this looks like I am getting a lot of late blocks. Also I noticed it takes a really really long time for my node to connect to a handful of peers.
Is there a way to put a meta data field 'number of hops' in the data packet?
Also I am located in Germany.

@terencechain
Copy link
Member Author

A lot of late blocks indeed. 10% of my blocks are 4s+ and I considered that to be bad. Preston's 4s+ blocks is 5% and his internet is much faster than mine. @Phistr90 can you open a feature request for 'number of hops'?

I'm going to close this issue and open subsequent issues to track late blocks.

What we learned from this issue:
1.) In prysm, when a validator misses a vote on "head" it's solely because of a late block received on the beacon node side. There's no protocol bug. Confirmed via metric and one can manually change validator wait time to 6s to submit attestation. The % of correct head vote will increase.
2.) This is could an infra set up limitation problem from nodes getting launched in cluster set up behind cloud provider, relay node issue? Will need further investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants