Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC or transaction errors will lead to validator slashing #357

Open
andrey-kuprianov opened this issue Oct 13, 2021 · 0 comments
Open

GRPC or transaction errors will lead to validator slashing #357

andrey-kuprianov opened this issue Oct 13, 2021 · 0 comments

Comments

@andrey-kuprianov
Copy link

Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe

severity: High
type: Implementation bug
difficulty: Easy

Involved artifacts

Description

Orchestrator's function eth_signer_main_loop(), which is responsible for signing outgoing from Cosmos to Ethereum validator set updates, batches, and arbitrary logic calls, follows the same pattern in processing them, demonstrated here on the example of validator updates:

// sign the last unsigned valsets
match get_oldest_unsigned_valsets(
    &mut grpc_client,
    our_cosmos_address,
    contact.get_prefix(),
)
.await
{
    Ok(valsets) => {
        if valsets.is_empty() {
            trace!("No validator sets to sign, node is caught up!")
        } else {
            info!(
                "Sending {} valset confirms starting with {}",
                valsets.len(),
                valsets[0].nonce
            );
            let res = send_valset_confirms(
                &contact,
                ethereum_key,
                fee.clone(),
                valsets,
                cosmos_key,
                gravity_id.clone(),
            )
            .await;
            trace!("Valset confirm result is {:?}", res);
            check_for_fee_error(res, &fee);
        }
    }
    Err(e) => trace!(
        "Failed to get unsigned valsets, check your Cosmos gRPC {:?}",
        e
    ),
}

There are multiple problems with the above code. First, if get_oldest_unsigned_valsets() returns an error, this error is just ignored, and logged at the lowest possible trace level. The error will be ignored with every iteration of the loop, without limit.

Second, if get_oldest_unsigned_valsets() doesn't return an error, the code proceeds to call send_valset_confirms() in the following way:

let res = send_valset_confirms(...).await;
trace!("Valset confirm result is {:?}", res);
check_for_fee_error(res, &fee);

The function send_valset_confirms() involves submitting a Cosmos transaction, which may fail. As can be seen the result of executing this function is assigned to variable res, which is then checked for errors by the function check_for_fee_error(), which has the following structure:

fn check_for_fee_error(res: Result<TxResponse, CosmosGrpcError>, fee: &Coin) {
    if let Err(CosmosGrpcError::InsufficientFees { fee_info }) = res {
        match fee_info {
            FeeInfo::InsufficientFees { min_fees } => {
                error!("Your specified fee value {} is too small please use at least {}", ...);
                error!("Correct fee argument immediately! You will be slashed within a few hours if you fail to do so");
                exit(1);
            }
            FeeInfo::InsufficientGas { .. } => {
                panic!("Hardcoded gas amounts insufficient!");
            }
        }
    }
}

The problem is that check_for_fee_error() indeed checks only for fee errors; any other error that res may contain will be silently ignored, again only logging the result at the lowest trace level.

Problem Scenarios

  1. Errors may occur when communicating with Cosmos via GRPC; these errors are ignored, only logging at the lowest trace level.
  2. Errors may occur when submitting a Cosmos transaction; again these errors are ignored, only logging at the lowest trace level.

The validator will get slashed within a few hours for either of the above external errors.

Recommendation

Short term

Rework the above problematic code:

  • Log errors at the highest possible error level;
  • Do not ignore errors returned as function results: all error cases should be handled.

Long term

Some transient network or transaction errors may, and will happen. The orchestrator is a very sensitive piece of software, as its malfunctioning will lead to validator slashing. Thus, the orchestrator should be enhanced with the logic that monitors for errors occurring over prolonged periods of time, and implements various defensive mechanisms depending of the length of the period:

  • for short periods logging at the error level may be enough;
  • for longer periods (between 10 minutes - several hours, should be configurable), more actions should be taken:
    • try to reconnect, or connect to different nodes;
    • notify the validator by configurable means (a special error message, email, etc.);
    • if only one of the communication links is broken (GRPC / Cosmos RPC), use the other one to communicate about the downtime period.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant