GRPC or transaction errors will lead to validator slashing #268

hannydevelop · 2021-10-21T22:36:48Z

Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe

severity: High
type: Implementation bug
difficulty: Easy

Involved artifacts

Orchestrator: main_loop.rs

Description

Orchestrator's function eth_signer_main_loop(), which is responsible for signing outgoing from Cosmos to Ethereum validator set updates, batches, and arbitrary logic calls, follows the same pattern in processing them, demonstrated here on the example of validator updates:

// sign the last unsigned valsets
match get_oldest_unsigned_valsets(
    &mut grpc_client,
    our_cosmos_address,
    contact.get_prefix(),
)
.await
{
    Ok(valsets) = {
        if valsets.is_empty() {
            trace!("No validator sets to sign, node is caught up!")
        } else {
            info!(
                "Sending {} valset confirms starting with {}",
                valsets.len(),
                valsets[0].nonce
            );
            let res = send_valset_confirms(
                &contact,
                ethereum_key,
                fee.clone(),
                valsets,
                cosmos_key,
                gravity_id.clone(),
            )
            .await;
            trace!("Valset confirm result is {:?}", res);
            check_for_fee_error(res, &fee);
        }
    }
    Err(e) = trace!(
        "Failed to get unsigned valsets, check your Cosmos gRPC {:?}",
        e
    ),
}

There are multiple problems with the above code. First, if get_oldest_unsigned_valsets() returns an error, this error is just ignored, and logged at the lowest possible trace level. The error will be ignored with every iteration of the loop, without limit.

Second, if get_oldest_unsigned_valsets() doesn't return an error, the code proceeds to call send_valset_confirms() in the following way:

let res = send_valset_confirms(...).await;
trace!("Valset confirm result is {:?}", res);
check_for_fee_error(res, &fee);

The function send_valset_confirms() involves submitting a Cosmos transaction, which may fail. As can be seen the result of executing this function is assigned to variable res, which is then checked for errors by the function check_for_fee_error(), which has the following structure:

fn check_for_fee_error(res: Result<TxResponse, CosmosGrpcError, fee: &Coin) {
    if let Err(CosmosGrpcError::InsufficientFees { fee_info }) = res {
        match fee_info {
            FeeInfo::InsufficientFees { min_fees } = {
                error!("Your specified fee value {} is too small please use at least {}", ...);
                error!("Correct fee argument immediately! You will be slashed within a few hours if you fail to do so");
                exit(1);
            }
            FeeInfo::InsufficientGas { .. } = {
                panic!("Hardcoded gas amounts insufficient!");
            }
        }
    }
}

The problem is that check_for_fee_error() indeed checks only for fee errors; any other error that res may contain will be silently ignored, again only logging the result at the lowest trace level.

Problem Scenarios

Errors may occur when communicating with Cosmos via GRPC; these errors are ignored, only logging at the lowest trace level.
Errors may occur when submitting a Cosmos transaction; again these errors are ignored, only logging at the lowest trace level.

The validator will get slashed within a few hours for either of the above external errors.

Recommendation

Short term

Rework the above problematic code:

Log errors at the highest possible error level;
Do not ignore errors returned as function results: all error cases should be handled.

Long term

Some transient network or transaction errors may, and will happen. The orchestrator is a very sensitive piece of software, as its malfunctioning will lead to validator slashing. Thus, the orchestrator should be enhanced with the logic that monitors for errors occurring over prolonged periods of time, and implements various defensive mechanisms depending of the length of the period:

for short periods logging at the error level may be enough;
for longer periods (between 10 minutes - several hours, should be configurable), more actions should be taken:
try to reconnect, or connect to different nodes;
notify the validator by configurable means (a special error message, email, etc.);
if only one of the communication links is broken (GRPC / Cosmos RPC), use the other one to communicate about the downtime period.

The text was updated successfully, but these errors were encountered:

cbrit · 2021-10-26T18:59:59Z

Our corresponding Err variant logs at the information level and emits a metric counter for the UNSIGNED_VALSET_FAILURES event:

Err(e) => {
    metrics::UNSIGNED_VALSET_FAILURES.inc();
    info!(
        "Failed to get unsigned valsets, check your Cosmos gRPC {:?}",
        e
    );
}

cbrit · 2021-10-26T19:07:11Z

The send_valset_confirms lines are replaced by the following in our repo:

let messages = build::signer_set_tx_confirmation_messages(
    &contact,
    ethereum_key,
    valsets,
    cosmos_key,
    gravity_id.clone(),
);
msg_sender
    .send(messages)
    .await
    .expect("Could not send messages");

signer_set_tx_confirmation_messages contains several naked unwrap() calls that will cause panics if the cosmos/ethereum key processing fails somehow. However, I'm not sure that the application could get this far in the first place if there was an issue was those keys that would cause a panic in this block.

If msg_sender.send() errors, it will panic and end the Ethereum signing loop because it's handled by an .expect().

cbrit · 2021-10-26T20:40:30Z

check_for_fee_error does not exist in our code base

cbrit · 2021-10-27T14:54:28Z

Confirmed with Zaki that the relevant code in our main loops (inside of the Ok() blocks in orchestrator/src/main_loop.rs are
safe because the unwrap() calls are on values that would have caused an error sooner during execution if there were an issue.

cbrit · 2021-10-27T14:55:09Z

I think this is ready for review, @EricBolten @hannydevelop

EricBolten · 2021-10-27T22:15:03Z

Our corresponding Err variant logs at the information level and emits a metric counter for the UNSIGNED_VALSET_FAILURES event:
Err(e) => {
    metrics::UNSIGNED_VALSET_FAILURES.inc();
    info!(
        "Failed to get unsigned valsets, check your Cosmos gRPC {:?}",
        e
    );
}

Just an FYI, you can deep link to a code block so it is navigable like so:

gravity-bridge/orchestrator/orchestrator/src/main_loop.rs

Lines 309 to 315 in cc9ed23

    
           Err(e) => { 
        
               metrics::UNSIGNED_VALSET_FAILURES.inc(); 
        
               info!( 
        
                   "Failed to get unsigned valsets, check your Cosmos gRPC {:?}", 
        
                   e 
        
               ); 
        
           }

Related, though, it seems like the info log level is insufficient for surfacing these error cases. Depending on how common we expect them to be it should be at least warn, if not error.

EricBolten · 2021-10-27T22:43:13Z

The send_valset_confirms lines are replaced by the following in our repo:
let messages = build::signer_set_tx_confirmation_messages(
    &contact,
    ethereum_key,
    valsets,
    cosmos_key,
    gravity_id.clone(),
);
msg_sender
    .send(messages)
    .await
    .expect("Could not send messages");
signer_set_tx_confirmation_messages contains several naked unwrap() calls that will cause panics if the cosmos/ethereum key processing fails somehow. However, I'm not sure that the application could get this far in the first place if there was an issue was those keys that would cause a panic in this block.

If msg_sender.send() errors, it will panic and end the Ethereum signing loop because it's handled by an .expect().

So, in our case here, our message path looks like this:

gravity-bridge/orchestrator/orchestrator/src/main_loop.rs

Lines 303 to 306 in 354bcb9

    
           msg_sender 
        
               .send(messages) 
        
               .await 
        
               .expect("Could not send messages");

We compose a vector of messages and send them over a tokio channel that is established to connect eth_oracle_main_loop and eth_signer_main_loop to send_main_loop. The audit expressed concern that ultimately we were making a Cosmos gRPC call but only touching on a specific type of error rather than the possible set of errors. The previous comment error handling is referring to the success or failure of get_oldest_unsigned_valsets, which is a prerequisite before we try to send a message. The expect call here is if the message cannot be sent over the channel.

send_main_loop on the receiving end of the channel is where our message sending actually happens:

gravity-bridge/orchestrator/cosmos_gravity/src/send.rs

Lines 212 to 224 in eca0feb

    
           for msg_chunk in messages.chunks(msg_batch_size) { 
        
               match send_messages( 
        
                   contact, 
        
                   cosmos_key, 
        
                   gas_price.to_owned(), 
        
                   msg_chunk.to_vec(), 
        
                   gas_adjustment, 
        
               ) 
        
               .await 
        
               { 
        
                   Ok(res) => trace!("okay: {:?}", res), 
        
                   Err(err) => error!("fail: {}", err), 
        
               }

Here you'll see that we are catching all error types if the gRPC transaction fails. We are also not returning the response object to the caller (it's happening async over the channel) so any action we're going to take to surface this error to the operator has to start here. On the plus side, we are logging at the error level, but the message and the amount of detail logged by interpolating the error struct might be an insufficient level of detail to help an operator debug the issue they're having. If we're interested in reporting error metrics consistently, we'll probably also want to do so here.

EricBolten · 2021-10-27T22:53:08Z

Addressing the above should be sufficient for the short term concerns raised here, afterwards let's ensure we create an issue to track addressing the long term concerns in future development.

hannydevelop added althea issue audit difficulty: easy audit severity: high audit type: implementation bug labels Oct 21, 2021

hannydevelop added the informal audit label Oct 21, 2021

cbrit self-assigned this Oct 26, 2021

cbrit mentioned this issue Oct 28, 2021

Improved error handling for gRPC calls to Cosmos #275

Merged

cbrit linked a pull request Oct 28, 2021 that will close this issue

Improved error handling for gRPC calls to Cosmos #275

Merged

cbrit closed this as completed in #275 Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC or transaction errors will lead to validator slashing #268

GRPC or transaction errors will lead to validator slashing #268

hannydevelop commented Oct 21, 2021

cbrit commented Oct 26, 2021

cbrit commented Oct 26, 2021 •

edited

Loading

cbrit commented Oct 26, 2021

cbrit commented Oct 27, 2021

cbrit commented Oct 27, 2021

EricBolten commented Oct 27, 2021

EricBolten commented Oct 27, 2021 •

edited

Loading

EricBolten commented Oct 27, 2021

GRPC or transaction errors will lead to validator slashing #268

GRPC or transaction errors will lead to validator slashing #268

Comments

hannydevelop commented Oct 21, 2021

Involved artifacts

Description

Problem Scenarios

Recommendation

Short term

Long term

cbrit commented Oct 26, 2021

cbrit commented Oct 26, 2021 • edited Loading

cbrit commented Oct 26, 2021

cbrit commented Oct 27, 2021

cbrit commented Oct 27, 2021

EricBolten commented Oct 27, 2021

EricBolten commented Oct 27, 2021 • edited Loading

EricBolten commented Oct 27, 2021

cbrit commented Oct 26, 2021 •

edited

Loading

EricBolten commented Oct 27, 2021 •

edited

Loading