-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paravalidators authoring blocks but missing all votes #4991
Comments
I've been tracking this type of bugs down, some failure modes that we have been observing in the order of impact:
I'm not excluding the possibility of some other eluding bug, can you provide us with a validator that you control where this happened and all the logs you can provide around that time(the more the merrier), this should help us understand if it is one of the above or maybe something else. |
I've seen this "automatically resolving" on some nodes without manual interaction, is there an automatic peerId-reneval somehow without renewing the network secret or restarting the node?
I'll ask Saxemberg who's currently having these kinds of issues 🙏 Thanks for looking into it |
No, the PeerID can't change without restart, not sure about the public IP, since it really depends on your setup, providing the full polkadot starting cmdline would be of great help as well. |
I can tell that in my case it was not restarted and IP change can't happen (own /26 subnet in my colocation rack). The startup command was pretty basic:
And it happened on latest release. |
We want to report the same issue on one of our Kusama validators. |
@m-saxemberg
|
Might as well note that I'm seeing the same on my Kusama validators, most recent as yesterday. In my case a restart helps. |
@paradox-tt can you share the information from the above, the more logs you have the better. |
Seeing this right now in our Kusama validator Turboflakes This validator was working fine. Yesterday I rebooted the server (Ubuntu 22.04 LTS), and the service failed to start. I needed to regenerate the network key and used Edit - adding service file `[Unit] [Service] [Install] |
Getting 0 points after using Now the question is, why did you have to use |
Here are the logs, and thinking back to yesterday, I don't think I gracefully shutdown the service before restarting, so maybe that caused the problems and this is a red herring, because the db needed to be resyched again when it came up, too. https://pastebin.com/VZMf7Z6B (although still seems weird a db corruption would blow away the key, too, i did not check the contents of that path before regenerating the key) |
@alexggh this node is still not paravalidating after 72hrs+. Is there a recommended course of action? |
This has dinged us (LuckyFriday) several times – most recently today. Prepping logs to post. |
additionally I don't think that 6720279 will solve the issue ultimately, as most of the people affected by this issue DID NOT delete/regenerate their node secret or change their IP. It seems most likely to happen if a node re-enters from the waiting list (so it's frequently affecting 1KV validators) |
@mchaffee your polkadot validator https://apps.turboflakes.io/?hain=polkadot#/validator/14AakQ4jAmr2ytcrhfmaiHMpj5F9cR6wK1jRrdfC3N1oTbUz, seems to be getting load of points now did you do anything to fix it I assume just the restarts I see in the logs ?.
Anyways I think your problem is caused by the fact that your validator seem to be reporting 73 different IP address, that is bound to create problems with connectivity between validators, not sure how you got into this state(for context no other validator publishes more than 5) probably the information is in the logs before you join the active set, I would expect you would see a lot of:
FYI @paritytech/networking, maybe you got an idea how the node might got into this state, I've think we seen this happening on rococo here: #3519 (comment). Do you by any chance run more than on node which might connected between each other using those addresses ? I think after the restart your node cleared all those accumulated IP address and that's why it is working now. |
@CertHum-Jim looked at your
Given your validator gets 0 backing points and 0 points for authoring relay chain blocks, I would check if you validator is connected and in sync with the relay chain, could you also provide all the logs that you've got and I can try to advice based on that. |
Yes, the mentioned commit won't help if addresses did not change or if they are invalid or if there is some other bug at play, because that helps just with propagating newer addresses faster.
Could you please direct them on this ticket with their logs. Unfortunately this issue is generic enough and happens occasionally. Anything wrong with the validators will manifest as missing points, so unless we've got logs and specific details about the setup and network condition at the time it happens we won't be able to properly root-cause it. |
The node is online, the chain is synced, but the peer ID that is seen in p2p is not what is on the node (and you can see the correct one in telemetry and part of 1KV -- and when it change in the pastbin logs, when i originally reported the bug.) Also, in your output, the first IPv4 address and the third are correct. The second I have never seen before and is not from any provider we've ever used -- it's really strange that it is seeing that on the network. |
Just a little more info, the 103.240.197.70 IP comes from an AS run by Leapswitch and the block seems like it is assigned to Sunday Networks, a HK company. Traceroutes indicate it is located in India. So I'm not really sure how the same network ID can live in both Singapore and India at the same time. |
What do you mean is not correct in both dht and your logs I see Edit: I see your point you are saying your local node id is actually:
And the DHT shows us the old PeerId, that you lost ?
You are saying this one:
|
File is here: https://drive.google.com/file/d/1NN3lZyaPdiFu7TIeExjW_VuFeE_eiAL-/view?usp=sharing Maybe I'm just reading it wrong, but that 103. address sure does seems like it's associated with a lot of peers. |
Yes, indeed, but I think that's a read herring, because all other nodes work, I think the main problem is why all authorithies still see your old PeerId I'll try to figure it out what is happening tomorrow, till then I guess your best try to recover this validator is to change your authority key by doing a key rotation, which was found to be helpful in this kind of cases. |
Ok, thanks for looking into it. Also, i hijacked this issue a bit because I think the original report from @SimonKraus occurred without the network ID being changed on the node, so just adding that to put it back on topic (although maybe the root is related). We did have that happen once (where no changes were made, just out and in of the set) on our TVP polkadot node and we found the only remedy to deploy a new node with a new network and authority key. But for this instance I'll just rotate and hopefully will start working. |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
This issue is going on for some months already now on both networks - Polkadot and Kusama.
It has been discussed some times in Matrix and @bLd75 dropped something related to stackexchange.
Problem:
Validators are proposing relaychain blocks that are accepted.
The Validator in does not vote on any parachain blocks (neither implicit nor explicit) and misses all votes resulting in 0 backing points.
At the time of this happening there are no weird metrics going on on a quick glance and there's nothing suspicious in the logs.
Currently this affects 20 Validators in the active Kusama sets according to Turboflakes Dashboard.
I've myself experienced this issue multiple times and spoke to many fellow operators and despite some things work for in some cases (like restarting the node, rotating the keys, changing key permissions) this issue is still very much unpredictable to me.
Steps to reproduce
unverifiableRandomnessFunction()
The text was updated successfully, but these errors were encountered: