Topdown finality proposal and execution #310

cryptoAtwill · 2023-10-10T04:39:04Z

Integrate topdown proposal and execution.

@aakoshh This is the previous approved PR with:

Using validator change set
Update configuration
Initial simplified version for top down message execution. I understand there are more sophisticated methods for execution, but the current one is for integration testing to see execution with contract works. Once this is ok, will update to better methods such as left over gas in the block explicitly.

adlrocha

Looking good. Thanks! I am missing the gas management, but I would get this one merged first so we can start testing the end-to-end.

adlrocha · 2023-10-10T06:57:27Z

fendermint/app/config/default.toml

+# IPC related configuration parameters
+[ipc]
+subnet_id = "/r0"
+network_name = "dev"


Wouldn't you need to add here the settings for top-down too?

What is a networ_name? IIRC @dnkolegov said it's the same as the subnet ID.
Sorry I can't keep it in my head 😞

adlrocha · 2023-10-10T06:58:21Z

fendermint/app/settings/src/lib.rs

+    /// The max number of retries for exponential backoff before giving up
+    pub exponential_retry_limit: usize,
+    /// The ipc provider url
+    pub ipc_provider_url: String,


This should be the endpoint for the parent, right? Not for the whole ipc provider. This is confusing.

Suggested change

pub ipc_provider_url: String,

pub parent_endpoint: String,

adlrocha · 2023-10-10T06:58:23Z

fendermint/app/settings/src/lib.rs

+    pub ipc_provider_url: String,
+    /// The parent registry address
+    #[serde(deserialize_with = "deserialize_eth_address_from_str")]
+    pub registry_address: Address,


Maybe it i a good idea to make it more explicit?

Suggested change

pub registry_address: Address,

pub parent_registry: Address,

The gateway and the registry do not have stable addresses on Lotus, right?

Correct, that is why we need to offer a config for that (for L2+ subnets this will be easier as we can point to the deterministic address by Fendermint)

fendermint/app/settings/src/lib.rs

fendermint/app/src/app.rs

fendermint/app/src/cmd/run.rs

adlrocha · 2023-10-10T07:39:53Z

fendermint/vm/interpreter/src/chain.rs

+                        .deliver(state, VerifiableMessage::NotVerify(msg))
+                        .await?;
+                    if ret.is_err() {
+                        // TODO: how to handle error here? Do we return Err or stash them into one ret?


I think in this case we should just log the error and return without doing anything else, as the finality commitment failed.

I agree, just return immediately, nothing we can do here.

Ah, wait, we are using the terms too loosely here. What do we mean by "return" exactly?

My initial thought we can just return Err(anyhow!("failed to apply finality")) and by doing that crash Fendermint, because if a legitimately voted upon finality fails to be added to the contract, we have a serious bug in the system.

But you might argue it's better to keep chugging along, effectively severed from the parent subnet, in which case we can return Ok(ret) and go into terra incognita, see if we end up with a consensus failure or live to fight another day.

Whichever it is, I don't think we should continue with the execution of messages.

In similar situations I chose to crash the application:

can't create a checkpoint

couldn't run cron

Basically if I think it should not fail and if it does, I'd rather stop than limp along with potentially invalid state.

adlrocha · 2023-10-10T07:48:16Z

fendermint/vm/interpreter/src/chain.rs

+                        .deliver(state, VerifiableMessage::NotVerify(msg))
+                        .await?;
+                    if ret.is_err() {
+                        // TODO: how to handle error here? Do we return Err or stash them into one ret?


Same here, if this fails I would just abort and try again in the next finality committed.

What can be a legitimate legitimate reasons for this to fail?

So again we could crash saying this must be a bug, since these changes have been added to the parent ledger already, how can we not add them on the child? I'm not sure how we could fix the contract though, if there was a bug, but maybe the bug is in the syncer?

Or we can choose to take it on the chin and return Ok(ret), however the problem is that we have already committed the finality itself, so the syncer will think it doesn't have to re-fetch these changes, and then we have two possibilities:

We have a bug in the syncer of this node instance, and everyone else applied the changes, in which case it's a consensus failure

Nobody applied these changes, in which case it creates a gap in the ledger that might not exist on the parent, which can lead to unaccepted checkpoints

Quite possibly the best thing would be to apply all 3 changes as one atomic call to the contracts. I don't know about a way to selectively rolling back first transaction at this point.

I would choose to crash for the same reason I mentioned here.

adlrocha · 2023-10-10T07:53:59Z

fendermint/vm/interpreter/src/chain.rs

+                        .deliver(state, VerifiableMessage::NotVerify(msg))
+                        .await?;
+                    if ret.is_err() {
+                        // TODO: how to handle error here? Do we return Err or stash them into one ret?


This one is a bit messier, it really depends on the type of error:

We should have in the contract something that if the message execution reverts, we increase the nonce and make the user pay for the work done.

Then we should have other errors to apply_message that revert before even the nonce can be increased that should be handled with care, and we should manually increase the nonce in the contract state if we don't want the execution to be stuck.

Once again, crashing might be our best option to avoid invalid state.

I agree that the messages in the batch should be executed separately and their result should be emitted as events regardless of the outcome - at least I can't think of a better way of surfacing them.

But the overall execution itself should not fail. If there is a reason it fails, we should fix it and resume from there, rather than sometimes skip entire batches because of a rotten apple or who knows what causes it to fail.

For example say it fails because even implicit executions respect the gas limit, which we set at BLOCK_GAS_LIMIT. Even though we talked about it when I proposed that messages should be appended to a queue, not immediately executed, and then executed as and when there is unused gas in the block, at this point it's just one big bang and it can fail.

What then, do we just say oh well, sometimes depending on how many messages there are we skip entire batches? Or should we say we quickly patch the application to try to break it up into multiple batches?

Of course crashing all nodes probably means the chain outage is going to be unacceptable, which leads me to back to my point that the best thing to do would be to treat the entire top-down execution as an atomic unit and mark it as failed, so that it can be re-tried again and again until it's fixed, but without crashing Fendermint or halting all other applications.

Sorry, I know I am going back and forth from crashing to marking as failed, but in this case there is a legitimate reason to for failure, which is this overall gas limit. What we should not do is partial failure IMO.

Another relatively easy to implement option to prevent running out of gas while handling the batches is to do https://github.com/consensus-shipyard/fendermint/issues/261

fendermint/vm/interpreter/src/chain.rs

fendermint/app/settings/src/lib.rs

aakoshh · 2023-10-10T13:43:15Z

fendermint/app/settings/src/lib.rs

+
+#[serde_as]
+#[derive(Debug, Deserialize, Clone)]
+pub struct IPCSettings {


Suggested change

pub struct IPCSettings {

pub struct IpcSettings {

So that we match existing capitalization here and [here](https://github.com/consensus-shipyard/fendermint/blob/main/fendermint/vm/genesis/src/lib.rs#L152] for example.

aakoshh · 2023-10-10T13:50:42Z

fendermint/app/src/cmd/run.rs

+    let url = topdown_config
+        .ipc_provider_url
+        .parse()
+        .context("invalid agent URL")?;


So, no more agent?

It would be nice to move this parsing into the settings and use a Url type instead of String, to catch problems early. There is a type in tendermint_rpc we already use here, perhaps it can do the trick.

fendermint/app/src/ipc.rs

fendermint/eth/hardhat/src/lib.rs

fendermint/vm/interpreter/src/chain.rs

aakoshh · 2023-10-10T15:40:16Z

fendermint/vm/interpreter/src/chain.rs

+        version: 0,
+        from: system::SYSTEM_ACTOR_ADDR,
+        to: ipc::GATEWAY_ACTOR_ADDR,
+        value: TokenAmount::zero(),


@adlrocha can you walk me through what happens when somebody wants to fund their account by locking up tokens in the parent subnet and sending them to the child?

I'd expect that somewhere we need to parse messages and recognise which ones are top-down messages (to=gateway, method=apply...), parse the cross messages, find the fund types, sum up their values and add them to the circ_supply.

A similar thing should happen with bottom-up messages: release should lower the circ_supply.

These are missing at the moment.

Nevertheless, the message goes to the Gateway who is supposed to send the amount to the recipient Ethereum account. But can the Gateway mint tokens, or can it only send what it has? Should perhaps this value here be used as a minting mechanism, so that incoming messages increase the balance of the Gateway itself, in order for it to be able to credit it to the funded accounts? Can the system account mint tokens?

When a user wants to fund an account in a subnet it calls fund in the gateway of the parent

This function commits the top-down message, and stores the message for propagation in the top down queue, updating the circulating supply, nonces, etc..

Fund is always top-down and release bottom-up, but when an arbitrary message arrives to the actor commitCrossMessage is the one responsible for routing it as a top-down or a bottom-up.

But can the Gateway mint tokens, or can it only send what it has? Should perhaps this value here be used as a minting mechanism, so that incoming messages increase the balance of the Gateway itself, in order for it to be able to credit it to the funded accounts? Can the system account mint tokens?

This is a great question, what we did in Lotus is pre-allocate the maximum circulating supply that we want to allow in the subnet to the gateway, so the circulating supply is pre-minted and made available to the gateway to mind an burn when needed (actually, when we say burn what we should actually do is send the balance back to the gateway to avoid reducing the circulating supply).

That being said, now that we don't have the limitations from Lotus, I agree that it would be best to have ad-hoc minting of new funds here. It is a more elegant solution, and it prevents all of the pre-allocation and burning dance that we currently have.

Signed-off-by: Alfonso de la Rocha <[email protected]>

fendermint/app/settings/src/lib.rs

fendermint/vm/topdown/src/lib.rs

aakoshh

For posterity: I see we are crashing if any of the 3 parts of the execution of top-down finality goes wrong.

I think that's not a bad way to go in the next two weeks, but I maintain that we should rethink our approach to execution to respect gas limits:

by not voting on stuff over the limit Topdown finality proposal and execution #310 (comment)
by deferring execution to later blocks with free gas capacity
by putting all three aspects of finality into a single call that fails together, so can retry in a later block if it failed, rather than crash

adlrocha · 2023-10-24T16:20:25Z

@cryptoAtwill, I addressed @aakoshh comments (remember to update you local Fendermint config when you test again), but I see this PR depends on this one. @aakoshh, do you mind checking that one too see if we can merge it? Thanks 🙏

* fix lotus querying * query range * fix decoding * Revert typechain * log of parsing * fix order * more logs * more logs * more logs * disable change set * debug * more code * more code * more code * more logs * mint to gateway * debug code * working version with clean up * update comments * remove testing code * fix lint * adding topdown module * remove unused code * Test staking: Part 2 (#319) * TEST: Use EthAddress * TEST: Update closure * TEST: Generate join and leave * TEST: Separate configuration number for current and next * TEST: Take &mut self in StakingState * TEST: Test join * TEST: Test collateral and balance * TEST: Test stake command; fix check of has_joined * TEST: Test leave command; fix total collateral * TEST: Test checkpoint command; signature fails * TEST: Build checkpoint during command run * TEST: Higher default random size * TEST: Unit test for ABI encoding * TEST: Hash checkpoint as tuple * TEST: Try sending the signature with +27 recovery ID * TEST: Use non-masked ethereum address * TEST: Do not increment config when staking 0 * TEST: Tweak token amount generation so there aren't that many zeroes * TEST: Choose min_collateral so the last joiner activates * TEST: Bootstrap tests * TEST: Claim * TEST: Show error data in genesis * TEST: min_collateral > 0 * TEST: use local actors * TEST: Ranking * TEST: Subnet deactivation * TEST: Trying to debug the active validator set * TEST: See doesn't need bytes * TEST: Do not fail test if minimum collateral disagrees * TEST: Maybe we should fail * TEST: Assert active limit * TEST: Fix repeatability * TEST: Example of going over the limit of 2 active validators * TEST: Debugged the over-the-limit validators * TEST: Fix the active collateral sum * TEST: Quit testing if we hit the situation of differently ordered minimum collaterals * TEST: Check the hash of cross messages * TEST: Point at the integration testing branch of ABIs * TEST: Fix clippy * TEST: Update ABI git reference * FIX: Only take the /out * FIX: Install openssl in docker * TEST: Update Rust version in docker * FIX: Remove debug * FIX: 27 shift only needed by Solidity * TEST: Add unstake * FIX: SMT unit tests * FIX: Update RocksDB code * TEST: Use IPC actors dev * Update fendermint/vm/interpreter/src/fvm/state/ipc.rs Co-authored-by: Akosh Farkash <[email protected]> * update cargo * format code * log parent sync messages * sort staking change requests * sort in ascending order * fmt code --------- Co-authored-by: Akosh Farkash <[email protected]>

cryptoAtwill added 15 commits October 4, 2023 17:26

migrate top down crate

5f13dbf

top down crate

5be01f1

new ipc changes

c097c33

export proxy mod

84c0a7f

fix tests

1ce66f6

update cargo

e0e248f

update convert

6d553f1

stashed

f84d37f

stashed

4f0d20e

initial impl

f94fb64

update cargo

e648c2c

lint

62a46c5

merge with upstream

20c267d

update tests

0672f98

execute top down

1d75966

cryptoAtwill requested review from aakoshh and adlrocha October 10, 2023 04:39

format code

f5d2eb7

adlrocha approved these changes Oct 10, 2023

View reviewed changes