fix: Replicas forget messages and halt on restart #228

brunoffranca · 2025-01-17T18:25:20Z

What ❔

Fixes issue with replicas not sending timeouts/new views after restarting.
Adds more logs for better visibility.

Why ❔

From Slack:

I've deployed ChonkyBFT with multiple (3) validators on stage two weeks ago and it has halted a few times, always requiring a regenesis (basically a hard fork) to restart it.
Here's the issue. Right now, when we send a message it is added to a queue in our network component. There the component retries sending the message until it succeeds. That's basically what we describe in network model here. The problem though is that when a node is restarted it loses that queue of messages it is trying to send. The only data we persist is this. What seems to be happening is this:

All replicas timeout.

Some replicas crash before they can send their timeout message.

When those replicas restart they remember having timed out and won't try sending the timeout message again. It can happen if the crash happens here.

If >f replicas crash in this particular way, the consensus halts.

This is easier to happen in stage where we have f=0 and nodes are restarted several times a day. The solution to this is quite easy, we just keep resending timeouts in the bft component. It doesn't really alter our spec, it's just an implementation detail.
But this got me thinking about the deeper issue here, which is that we can't guarantee eventual delivery of messages. No matter if we persist all the messages to a database, or try to be really careful about where we send messages, it is always possible that a node just forgets all messages that it is trying to send. If we assume that arbitrary messages can be dropped, then it's trivial to halt the consensus:

All replicas timeout.

Subset of replicas A receives all timeouts and moves to next view.

Subset of replicas B receives no messages from the other subset. No timeouts or new views.

Now GST happens and no messages are being dropped.

Set A and B are in different views. They are all sending timeouts forever without any set reaching a quorum. Unfortunately set A does not resend new view messages since they already sent them when they changed view. If they did, set B would synchronize with set A.

We should change our assumption from "After GST all messages are delivered including past ones" to "After GST all messages start being delivered". This would strengthen our security model to allow messages being dropped until GST. This is a more realistic model for the networks we operate in and kind of a stronger form of the partially synchronous model.
The changes are minimal, we need to keep resending timeouts like I said before. In addition, whenever we timeout we also send a new view message. So, we basically run a loop sending new view and timeout messages until we progress. It is simple to see that this would work. For any set of replicas in different views and with no memory of past messages, they will keep sending new views until they all synchronize to the same view and then the timeouts will move everyone to the next view.

🤖 I have created a release *beep* *boop* --- ## [0.8.0](v0.7.0...v0.8.0) (2025-01-20) ### Features * documented the dangers of ordering in message encoding ([#225](#225)) ([c1b8e6e](c1b8e6e)) ### Bug Fixes * Replicas forget messages and halt on restart ([#228](#228)) ([c723fbe](c723fbe)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: zksync-era-bot <[email protected]>

brunoffranca added 2 commits January 17, 2025 16:53

Keep resending timeouts.

50e17a0

send new views together with timeouts

7d8d858

brunoffranca marked this pull request as draft January 17, 2025 18:25

brunoffranca added 5 commits January 17, 2025 18:59

update informal spec

1a32cf8

fix typo

f56a3b7

Fixed tests. They all pass now.

1034a3d

Added better logs.

f1f8b1a

nit

108da3d

brunoffranca force-pushed the bf-stage-fix branch from da72237 to 108da3d Compare January 20, 2025 19:16

Merge branch 'main' into bf-stage-fix

804e0fc

brunoffranca marked this pull request as ready for review January 20, 2025 19:23

brunoffranca changed the title ~~Bf stage fix~~ fix: Replicas forget messages and halt on restart Jan 20, 2025

clippy

c382a71

brunoffranca merged commit c723fbe into main Jan 20, 2025
7 checks passed

brunoffranca deleted the bf-stage-fix branch January 20, 2025 20:55

zksync-era-bot mentioned this pull request Jan 20, 2025

chore(main): release 0.8.0 #226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Replicas forget messages and halt on restart #228

fix: Replicas forget messages and halt on restart #228

brunoffranca commented Jan 17, 2025 •

edited

Loading

fix: Replicas forget messages and halt on restart #228

fix: Replicas forget messages and halt on restart #228

Conversation

brunoffranca commented Jan 17, 2025 • edited Loading

What ❔

Why ❔

brunoffranca commented Jan 17, 2025 •

edited

Loading