-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPC or transaction errors will lead to validator slashing #268
Comments
Our corresponding
|
The
If |
|
Confirmed with Zaki that the relevant code in our main loops (inside of the |
I think this is ready for review, @EricBolten @hannydevelop |
Just an FYI, you can deep link to a code block so it is navigable like so: gravity-bridge/orchestrator/orchestrator/src/main_loop.rs Lines 309 to 315 in cc9ed23
Related, though, it seems like the |
So, in our case here, our message path looks like this: gravity-bridge/orchestrator/orchestrator/src/main_loop.rs Lines 303 to 306 in 354bcb9
We compose a vector of messages and send them over a tokio channel that is established to connect
gravity-bridge/orchestrator/cosmos_gravity/src/send.rs Lines 212 to 224 in eca0feb
Here you'll see that we are catching all error types if the gRPC transaction fails. We are also not returning the response object to the caller (it's happening async over the channel) so any action we're going to take to surface this error to the operator has to start here. On the plus side, we are logging at the |
Addressing the above should be sufficient for the short term concerns raised here, afterwards let's ensure we create an issue to track addressing the long term concerns in future development. |
Original Issue
Surfaced from @informalsystems audit of Althea Gravity Bridge at commit 19a4cfe
severity: High
type: Implementation bug
difficulty: Easy
Involved artifacts
Description
Orchestrator's function eth_signer_main_loop(), which is responsible for signing outgoing from Cosmos to Ethereum validator set updates, batches, and arbitrary logic calls, follows the same pattern in processing them, demonstrated here on the example of validator updates:
There are multiple problems with the above code. First, if
get_oldest_unsigned_valsets()
returns an error, this error is just ignored, and logged at the lowest possibletrace
level. The error will be ignored with every iteration of the loop, without limit.Second, if
get_oldest_unsigned_valsets()
doesn't return an error, the code proceeds to callsend_valset_confirms()
in the following way:The function send_valset_confirms() involves submitting a Cosmos transaction, which may fail. As can be seen the result of executing this function is assigned to variable
res
, which is then checked for errors by the function check_for_fee_error(), which has the following structure:The problem is that
check_for_fee_error()
indeed checks only for fee errors; any other error thatres
may contain will be silently ignored, again only logging the result at the lowesttrace
level.Problem Scenarios
Errors may occur when communicating with Cosmos via GRPC; these errors are ignored, only logging at the lowest
trace
level.Errors may occur when submitting a Cosmos transaction; again these errors are ignored, only logging at the lowest
trace
level.The validator will get slashed within a few hours for either of the above external errors.
Recommendation
Short term
Rework the above problematic code:
Log errors at the highest possible
error
level;Do not ignore errors returned as function results: all error cases should be handled.
Long term
Some transient network or transaction errors may, and will happen. The orchestrator is a very sensitive piece of software, as its malfunctioning will lead to validator slashing. Thus, the orchestrator should be enhanced with the logic that monitors for errors occurring over prolonged periods of time, and implements various defensive mechanisms depending of the length of the period:
for short periods logging at the
error
level may be enough;for longer periods (between 10 minutes - several hours, should be configurable), more actions should be taken:
try to reconnect, or connect to different nodes;
notify the validator by configurable means (a special error message, email, etc.);
if only one of the communication links is broken (GRPC / Cosmos RPC), use the other one to communicate about the downtime period.
The text was updated successfully, but these errors were encountered: