Skip to content
This repository has been archived by the owner on Jan 11, 2024. It is now read-only.

Topdown finality proposal and execution #310

Merged
merged 53 commits into from
Oct 25, 2023
Merged

Conversation

cryptoAtwill
Copy link
Contributor

Integrate topdown proposal and execution.

@aakoshh This is the previous approved PR with:

  • Using validator change set
  • Update configuration
  • Initial simplified version for top down message execution. I understand there are more sophisticated methods for execution, but the current one is for integration testing to see execution with contract works. Once this is ok, will update to better methods such as left over gas in the block explicitly.

Copy link
Contributor

@adlrocha adlrocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thanks! I am missing the gas management, but I would get this one merged first so we can start testing the end-to-end.

# IPC related configuration parameters
[ipc]
subnet_id = "/r0"
network_name = "dev"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you need to add here the settings for top-down too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a networ_name? IIRC @dnkolegov said it's the same as the subnet ID.
Sorry I can't keep it in my head 😞

/// The max number of retries for exponential backoff before giving up
pub exponential_retry_limit: usize,
/// The ipc provider url
pub ipc_provider_url: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the endpoint for the parent, right? Not for the whole ipc provider. This is confusing.

Suggested change
pub ipc_provider_url: String,
pub parent_endpoint: String,

pub ipc_provider_url: String,
/// The parent registry address
#[serde(deserialize_with = "deserialize_eth_address_from_str")]
pub registry_address: Address,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it i a good idea to make it more explicit?

Suggested change
pub registry_address: Address,
pub parent_registry: Address,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gateway and the registry do not have stable addresses on Lotus, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that is why we need to offer a config for that (for L2+ subnets this will be easier as we can point to the deterministic address by Fendermint)

fendermint/app/settings/src/lib.rs Outdated Show resolved Hide resolved
fendermint/app/src/app.rs Show resolved Hide resolved
fendermint/app/src/cmd/run.rs Outdated Show resolved Hide resolved
.deliver(state, VerifiableMessage::NotVerify(msg))
.await?;
if ret.is_err() {
// TODO: how to handle error here? Do we return Err or stash them into one ret?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we should just log the error and return without doing anything else, as the finality commitment failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, just return immediately, nothing we can do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, wait, we are using the terms too loosely here. What do we mean by "return" exactly?

My initial thought we can just return Err(anyhow!("failed to apply finality")) and by doing that crash Fendermint, because if a legitimately voted upon finality fails to be added to the contract, we have a serious bug in the system.

But you might argue it's better to keep chugging along, effectively severed from the parent subnet, in which case we can return Ok(ret) and go into terra incognita, see if we end up with a consensus failure or live to fight another day.

Whichever it is, I don't think we should continue with the execution of messages.

Copy link
Contributor

@aakoshh aakoshh Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In similar situations I chose to crash the application:

Basically if I think it should not fail and if it does, I'd rather stop than limp along with potentially invalid state.

.deliver(state, VerifiableMessage::NotVerify(msg))
.await?;
if ret.is_err() {
// TODO: how to handle error here? Do we return Err or stash them into one ret?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, if this fails I would just abort and try again in the next finality committed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can be a legitimate legitimate reasons for this to fail?

So again we could crash saying this must be a bug, since these changes have been added to the parent ledger already, how can we not add them on the child? I'm not sure how we could fix the contract though, if there was a bug, but maybe the bug is in the syncer?

Or we can choose to take it on the chin and return Ok(ret), however the problem is that we have already committed the finality itself, so the syncer will think it doesn't have to re-fetch these changes, and then we have two possibilities:

  • We have a bug in the syncer of this node instance, and everyone else applied the changes, in which case it's a consensus failure
  • Nobody applied these changes, in which case it creates a gap in the ledger that might not exist on the parent, which can lead to unaccepted checkpoints

Quite possibly the best thing would be to apply all 3 changes as one atomic call to the contracts. I don't know about a way to selectively rolling back first transaction at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would choose to crash for the same reason I mentioned here.

.deliver(state, VerifiableMessage::NotVerify(msg))
.await?;
if ret.is_err() {
// TODO: how to handle error here? Do we return Err or stash them into one ret?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a bit messier, it really depends on the type of error:

  • We should have in the contract something that if the message execution reverts, we increase the nonce and make the user pay for the work done.
  • Then we should have other errors to apply_message that revert before even the nonce can be increased that should be handled with care, and we should manually increase the nonce in the contract state if we don't want the execution to be stuck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, crashing might be our best option to avoid invalid state.

I agree that the messages in the batch should be executed separately and their result should be emitted as events regardless of the outcome - at least I can't think of a better way of surfacing them.

But the overall execution itself should not fail. If there is a reason it fails, we should fix it and resume from there, rather than sometimes skip entire batches because of a rotten apple or who knows what causes it to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example say it fails because even implicit executions respect the gas limit, which we set at BLOCK_GAS_LIMIT. Even though we talked about it when I proposed that messages should be appended to a queue, not immediately executed, and then executed as and when there is unused gas in the block, at this point it's just one big bang and it can fail.

What then, do we just say oh well, sometimes depending on how many messages there are we skip entire batches? Or should we say we quickly patch the application to try to break it up into multiple batches?

Of course crashing all nodes probably means the chain outage is going to be unacceptable, which leads me to back to my point that the best thing to do would be to treat the entire top-down execution as an atomic unit and mark it as failed, so that it can be re-tried again and again until it's fixed, but without crashing Fendermint or halting all other applications.

Sorry, I know I am going back and forth from crashing to marking as failed, but in this case there is a legitimate reason to for failure, which is this overall gas limit. What we should not do is partial failure IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another relatively easy to implement option to prevent running out of gas while handling the batches is to do https://github.com/consensus-shipyard/fendermint/issues/261

fendermint/vm/interpreter/src/chain.rs Outdated Show resolved Hide resolved

#[serde_as]
#[derive(Debug, Deserialize, Clone)]
pub struct IPCSettings {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct IPCSettings {
pub struct IpcSettings {

So that we match existing capitalization here and [here](https://github.com/consensus-shipyard/fendermint/blob/main/fendermint/vm/genesis/src/lib.rs#L152] for example.

let url = topdown_config
.ipc_provider_url
.parse()
.context("invalid agent URL")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, no more agent?

It would be nice to move this parsing into the settings and use a Url type instead of String, to catch problems early. There is a type in tendermint_rpc we already use here, perhaps it can do the trick.

fendermint/app/src/ipc.rs Outdated Show resolved Hide resolved
version: 0,
from: system::SYSTEM_ACTOR_ADDR,
to: ipc::GATEWAY_ACTOR_ADDR,
value: TokenAmount::zero(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adlrocha can you walk me through what happens when somebody wants to fund their account by locking up tokens in the parent subnet and sending them to the child?

I'd expect that somewhere we need to parse messages and recognise which ones are top-down messages (to=gateway, method=apply...), parse the cross messages, find the fund types, sum up their values and add them to the circ_supply.

A similar thing should happen with bottom-up messages: release should lower the circ_supply.

These are missing at the moment.

Nevertheless, the message goes to the Gateway who is supposed to send the amount to the recipient Ethereum account. But can the Gateway mint tokens, or can it only send what it has? Should perhaps this value here be used as a minting mechanism, so that incoming messages increase the balance of the Gateway itself, in order for it to be able to credit it to the funded accounts? Can the system account mint tokens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When a user wants to fund an account in a subnet it calls fund in the gateway of the parent
  2. This function commits the top-down message, and stores the message for propagation in the top down queue, updating the circulating supply, nonces, etc..

Fund is always top-down and release bottom-up, but when an arbitrary message arrives to the actor commitCrossMessage is the one responsible for routing it as a top-down or a bottom-up.

But can the Gateway mint tokens, or can it only send what it has? Should perhaps this value here be used as a minting mechanism, so that incoming messages increase the balance of the Gateway itself, in order for it to be able to credit it to the funded accounts? Can the system account mint tokens?

This is a great question, what we did in Lotus is pre-allocate the maximum circulating supply that we want to allow in the subnet to the gateway, so the circulating supply is pre-minted and made available to the gateway to mind an burn when needed (actually, when we say burn what we should actually do is send the balance back to the gateway to avoid reducing the circulating supply).

That being said, now that we don't have the limitations from Lotus, I agree that it would be best to have ad-hoc minting of new funds here. It is a more elegant solution, and it prevents all of the pre-allocation and burning dance that we currently have.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: I see we are crashing if any of the 3 parts of the execution of top-down finality goes wrong.

I think that's not a bad way to go in the next two weeks, but I maintain that we should rethink our approach to execution to respect gas limits:

  • by not voting on stuff over the limit Topdown finality proposal and execution #310 (comment)
  • by deferring execution to later blocks with free gas capacity
  • by putting all three aspects of finality into a single call that fails together, so can retry in a later block if it failed, rather than crash

@adlrocha
Copy link
Contributor

@cryptoAtwill, I addressed @aakoshh comments (remember to update you local Fendermint config when you test again), but I see this PR depends on this one. @aakoshh, do you mind checking that one too see if we can merge it? Thanks 🙏

cryptoAtwill and others added 2 commits October 25, 2023 23:01
* fix lotus querying

* query range

* fix decoding

* Revert typechain

* log of parsing

* fix order

* more logs

* more logs

* more logs

* disable change set

* debug

* more code

* more code

* more code

* more logs

* mint to gateway

* debug code

* working version with clean up

* update comments

* remove testing code

* fix lint

* adding topdown module

* remove unused code

* Test staking: Part 2 (#319)

* TEST: Use EthAddress

* TEST: Update closure

* TEST: Generate join and leave

* TEST: Separate configuration number for current and next

* TEST: Take &mut self in StakingState

* TEST: Test join

* TEST: Test collateral and balance

* TEST: Test stake command; fix check of has_joined

* TEST: Test leave command; fix total collateral

* TEST: Test checkpoint command; signature fails

* TEST: Build checkpoint during command run

* TEST: Higher default random size

* TEST: Unit test for ABI encoding

* TEST: Hash checkpoint as tuple

* TEST: Try sending the signature with +27 recovery ID

* TEST: Use non-masked ethereum address

* TEST: Do not increment config when staking 0

* TEST: Tweak token amount generation so there aren't that many zeroes

* TEST: Choose min_collateral so the last joiner activates

* TEST: Bootstrap tests

* TEST: Claim

* TEST: Show error data in genesis

* TEST: min_collateral > 0

* TEST: use local actors

* TEST: Ranking

* TEST: Subnet deactivation

* TEST: Trying to debug the active validator set

* TEST: See doesn't need bytes

* TEST: Do not fail test if minimum collateral disagrees

* TEST: Maybe we should fail

* TEST: Assert active limit

* TEST: Fix repeatability

* TEST: Example of going over the limit of 2 active validators

* TEST: Debugged the over-the-limit validators

* TEST: Fix the active collateral sum

* TEST: Quit testing if we hit the situation of differently ordered minimum collaterals

* TEST: Check the hash of cross messages

* TEST: Point at the integration testing branch of ABIs

* TEST: Fix clippy

* TEST: Update ABI git reference

* FIX: Only take the /out

* FIX: Install openssl in docker

* TEST: Update Rust version in docker

* FIX: Remove debug

* FIX: 27 shift only needed by Solidity

* TEST: Add unstake

* FIX: SMT unit tests

* FIX: Update RocksDB code

* TEST: Use IPC actors dev

* Update fendermint/vm/interpreter/src/fvm/state/ipc.rs

Co-authored-by: Akosh Farkash <[email protected]>

* update cargo

* format code

* log parent sync messages

* sort staking change requests

* sort in ascending order

* fmt code

---------

Co-authored-by: Akosh Farkash <[email protected]>
@cryptoAtwill cryptoAtwill merged commit 69295d6 into main Oct 25, 2023
2 checks passed
@cryptoAtwill cryptoAtwill deleted the integrate-topdown branch October 25, 2023 15:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants