This repository has been archived by the owner on Jan 13, 2025. It is now read-only.
DR6 performance issues #7753
Labels
stale
[bot only] Added to stale content; results in auto-close after a week.
Milestone
Problem
Our performance in the recent DryRun6 was bad and we ended up unhealthy during most of the 'Ramp TPS' rounds. Currently, I'm at a loss for what might have caused us (Staking Facilities) missing so many slots.
I believe our machine isn't the problem (32 core, 96GB RAM, NVMe storage, 3x 2080Ti). I ran benchmarks today (v0.21.5) and was able to squeeze out 150k Max TPS with almost 90k sustained average TPS. No additional software was running on that machine & I stopped RPC'ing the node early on.
So basically, peering/latency/networking issues remain as a potential error cause. Our machine is co-located in an Equinix DC with a 100Mbit connection which can be bursted without limit though - according to Equinix.
Our log files: Google Drive
Let's look at epoch 82 (a Ramp TPS round). We were scheduled for 64 slots and missed 22 (see epoch82.log). I think 8 missed slots can be attributed to issue #7588
But our validator missed quite a lot of its slots on its own and contributed the the above problem. A typical pattern seems to be that it produces 1-2 slots and then misses the other 3-4. This indicates that we were timed out by the next leader. Why? And what can we do about it?
Two examples from epoch82.log:
312848 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U
312849 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U SKIPPED
312850 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U SKIPPED
312851 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U SKIPPED
312516 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U
312517 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U
312518 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U SKIPPED
312519 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U SKIPPED
Looking at the validator log file, can you tell what went wrong in those two examples?
The text was updated successfully, but these errors were encountered: