-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safeguards and mitigations to preserve liveness #222
Comments
The main part of the sidecar design is to be able to use the local execution client to produce blocks. If the relay doesn't reply to mev-boost, the proposer will still be able to get a valid block, get attestations, and earn rewards. TODO: check that all the consensus clients fall back to the local execution client when they get no reply to the question: what is the expected reply time from the relay? How much time will the proposer wait for an answer? Will there be plenty of time to execute the fallback code path? question: should the local execution client build a block in parallel so it is ready in case the relay fails? |
not just fall-back in event of no reply, but I would suggest running the build process in parallel and just not "getting the local block" if mev-boost is working properly. otherwise, abort mev-boost and get the locally built block EDIT: just saw your last question. that's what I would suggest 👍 |
Because of the previous point, and since the relay is trusted, the relay team can just disconnect it while the bug is solved. For the flashbots relay we will have two devops to cover most timezones, with their respective backup humans, and alerts to notify them when things are suspicious. When things are weird, the process should start with putting the relay offline while we understand the problem. question: what are the conditions we should monitor to identify problems? question: what weird things could start to happen if there is no trusted relay available for mev extraction? question: what happens if the relay was trusted but now goes rogue or just can't or won't shut down? |
We have to inform everybody of the risks of using a relay that is not trustworthy either because their intentions are not clear, they are profit-maximizing above all else, they have no reliable uptime, or they are not careful enforcing that the builders are providing valid and sensible blocks. We want all the people interested in running a relay to start by running a builder, so they understand all the challenges and get our support. See #145. We can have a relay monitor, so when one proposer is affected, they can share the information and warn the others. See #142. question: what makes a relay not trustworthy? question: if a relay repeatedly misbehaves, should mev-boost or the consensus client discard it and force the operator to run a command to enable it again? How often will mev-boost interact with the relay? If this is not often, then the permanent disconnection could be too slow. question: what metrics should be monitor? How do we translate this into a number that lets proposers evaluate the risk of using a specific relay. question: what happens if the relay monitor fails or goes rogue? |
The relays have to be very strongly and constantly scrutinized by the searchers, builders and proposers. A relay should get the blocks from a trusted builder or from a network of competing builders that is stable and not centralized. question: what happens if the most profitable builder is shady? Like anonymous and not trustworthy, solely profit oriented. The short term economic incentive would be to use it, then it will get a majority and will own the blockchain. 🤔 maybe the relay can rotate the builders, so none of them produces more than X% of the blocks. Not the best idea for profitability, but makes sense for long-term stability. |
Important to note that not all validators will be using mev-boost. But there's a strong economic incentive for them to use it, so I expect the majority will. Will the percentage of clueless or not profit-maximizing validators be relevant for preventing collapse? How many of them should there be to play a relevant role on stability? |
Last January, I prepared a document outlining a set of proposed mev-boost security features that aim to address potential relay faults which can lead to liveness issues. mev-boost in its current state does not mitigate these faults. Given they have the potential to lead to the chain stalling and failing to propose new blocks, I have put together in this post my thoughts on high priority mitigation paths ahead of the upcoming merge. Please read the original document before continuing to read this post! worst case scenario analysisLet's look at a worse case scenario. We assume that at the merge, >90% of validators are running mev-boost and are exclusively connected to the Flashbots relay, and mev-boost is deployed in its current state. A bug in the Flashbots relay could possibly lead it to have the following faults. Each fault can be analyzed as being "cascading" and "attributable". A cascading fault means that the validator of the current slot is not aware if the fault occurred to the validator in the previous slot. A non-attributable fault means that it is not possible to prove if the fault originated from validator or from relay misbehavior. Cascading faults are the most dangerous as they have the potential to impact chain liveness for extended periods of time. Attribution helps in mitigating cascading faults as fraud proofs can be constructed and programmatically used in a reputation system or circuit breaker, but it does not prevent the fault from occurring.
A bug or degraded performance due to DOS or other infrastructure outage in the relay causes it to propose block headers to validators, but fail to reveal the block body in time for inclusion in the chain. This is non-attributable because it is impossible for the network to differentiate if it is the validator or the relay that is causing the reveal delay.
A faulty relay simulation may cause it to send blocks that break consensus rules. This means it would reveal blocks to the network on time, but the blocks are not accepted by the attestation committees. This is an attributable fault because relays sign all the blocks they submit to validators.
A faulty relay simulation may cause it to send blocks that have valid consensus rules, but misrepresent the value of the blocks. An extreme case of this fault would lead to validators proposing empty blocks, or for the relay or block builder not to pay the feeRecipient. This fault could cause deteriorated user experience, but would not cause a consensus liveness issue for the network. This is an attributable fault because relays sign all the blocks they submit to validators. Questions:
potential mitigationsValidators need a way to identify these faults and disconnect from the offending relay programmatically. This means turning worse case scenarios into attributable non-cascading faults. Reveal Withholding appears to be the greatest threat and therefore priority to mitigate. The following mitigations will focus on this fault, but can be used for the other faults too.
A circuit breaker would be code implemented by the consensus client which says "disconnect from mev-boost if the network has not produced a block in X number of blocks". This requires the consensus clients to be able to inspect network traffic to identify when missed slots occur. In theory this should mitigate block withholding and invalid block faults by making them non-cascading. It does not make block withholding attributable. Questions:
A relay monitor is a third party system that a validator connects to and delegates the responsibility of monitoring relay performance. If the relay monitor identifies a relay has induced any of the three faults, it can send a message to mev-boost of all validators to disconnect from this relay. The clear advantage of this approach over the circuit break on the consensus client is that is solves for all three faults types without limitation on the data that can be accessed. The obvious drawback is that it adds an additional trusted party to the system who can have faults and outages of its own. This additional trust can be mitigated by connecting to multiple independent relay monitors with an 1/n policy. Questions:
A relay multi-sig means that mev-boost would implement some logic which requires x/n policy or more relays to propose the same block header for the header to be considered valid and be released to the consensus client. In theory, this should reduce the risk of faults from occurring if relays are run by independent parties and have independent implementations. This does not however seem to help improve cascading fault or attribution in the worse case scenario if multiple relays have correlated faults. Questions:
Fraud proofs or payment proofs involve taking an attributable fault and generating a proof that is submitted to all other validators in the network to notify them to disconnect from a relay. It can be used to turn cascading, attributable faults into non-cascading faults. This means it is not helpful for the withholding issue but can be used alongside other mitigation techniques. Questions:
worse case fault recovery testingWhichever mitigation is selected, it should be deployed and tested in a production environment by a super majority of mainnet node operators on diverse consensus clients. The test should simulate the worse case fault described above with 100% of node operators connected to the faulty relay and monitor that the chain is able to successfully recover and continue producing blocks. Node operators should only whitelist a relay once it has successfully completed this test. |
I'm interested on how we can design that worse case fault recovery testing. Sepolia could have a representative sample of the node operators proportional to their stake in mainnet. And then we could coordinate all the known big node operators for this kind of testing. Would that make sense? Or this would only be feasible in a lab simulated testnet? @parithosh @lightclient @yoavw, any thoughts? Anybody else from your teams that would want to collaborate on this? |
Here's one case in the category of trusted relay going rogue that we can't shut down. What happens if the flashbots DNS is attacked and we lose control over the domain? @sukoneck can we define and implement a policy for DNS changes on the mainnet relay that prevents a single employee for changing it, that prevents any customer support from the provider to change it, and that alerts of any changes? I've reported it in https://github.com/flashbots/infra/issues/105 mev-boost has to register the validator with the URL and the public key, and verify that every block received is signed with the corresponding private key. |
|
Sepolia has a permissioned validator set and while some of the staking pools are represented, I wouldn't say its proportional to mainnet. We'd like to keep the validator set small, so onboarding a lot of validators would be out of the question. I'd say an ephermeral testnet or a shadow fork is probably the easiest way to co-ordinate this sort of testing. I'd be happy to help with either. I'm assuming your team is already in touch with all potential participants and I'd mainly have to provide configs and validator keys? |
there is currently a 1 second timeout for mev-boost to fail to produce a block before the proposer moves to a local pathway: https://github.com/ethereum/builder-specs/blob/main/specs/validator.md#relation-to-local-block-building my only hesitation with local building in parallel is if the resource cost hinders those who would otherwise run nodes, e.g. at-home stakers although we should assume any proposer is sufficiently resourced to produce a block w/o the builder network and this kind of suggests we should update the directive in the builder-specs |
Would that really be the case? My understanding is that |
Actually currently in Teku we make an async request for an I am wondering shouldn't there be a timeout in the builder spec similar to BUILDER_PROPOSAL_DELAY_TOLERANCE (1s) for getting the payload from the builders? That way worst case scenario, block wouldn't get delayed too much and if a local |
I think there an interesting case to be made for each client to implement mitigation as they see fit rather than the entire network adopting the same mitigation technique. Diverse mitigations might mean more network resilience against accidental outages, and higher cost of deliberate attacks. It would be great to keep a reference of the mitigations used by each client in this issue or in the mev-boost documentation (cc @0xpanoramix). Here are some links to lighthouse and prysm: |
if I'm following you, you are referring to a timeout on the call to get the complete payload from the builder after having already signed the bid in this scenario, a proposer does not want to publish a competing block as it would be a slashable offence I think building in parallel makes sense but the proposer should only ever make one (1) signature for a given slot |
Yeah, I was referring to the payload call. As for the building in parallel, when the proposer asks for a block, it should be either MEV or a local one depending on timeouts or any exceptions. There will be only one signature. The additional timeout for the payload call could help potentially with mitigating any malicious delays when requesting the payload. |
I don't see how an additional timeout on the BN would help here. mev-boost tries to get the payload from all the relays, and as soon as it gets the payload from one it cancels the requests to other relays. This alone should mitigate any malicious delays from other relays. Otherwise mev-boost is using a 2 second relay timeout by default, configurable with |
Shouldn't this relationship be 1:1? My understanding is payload should come from the relay where mev-boost is called |
this feels like a thing that doesn't go into the spec |
I was referring to the Reveal Withholding aka "missing data" problem described above. If the beacon node is connected to a relay or has set a higher |
We want to document the conditions related to mev-boost and the Flashbots relay that would affect the liveness of the blockchain, to make sure that they are prevented or mitigated.
We want to test these conditions live in a testnet.
The text was updated successfully, but these errors were encountered: