Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Replicas forget messages and halt on restart #228

Merged
merged 9 commits into from
Jan 20, 2025
Merged

Conversation

brunoffranca
Copy link
Member

@brunoffranca brunoffranca commented Jan 17, 2025

What ❔

  • Fixes issue with replicas not sending timeouts/new views after restarting.
  • Adds more logs for better visibility.

Why ❔

From Slack:

I've deployed ChonkyBFT with multiple (3) validators on stage two weeks ago and it has halted a few times, always requiring a regenesis (basically a hard fork) to restart it.
Here's the issue. Right now, when we send a message it is added to a queue in our network component. There the component retries sending the message until it succeeds. That's basically what we describe in network model here. The problem though is that when a node is restarted it loses that queue of messages it is trying to send. The only data we persist is this. What seems to be happening is this:

  1. All replicas timeout.
  2. Some replicas crash before they can send their timeout message.
  3. When those replicas restart they remember having timed out and won't try sending the timeout message again. It can happen if the crash happens here.
  4. If >f replicas crash in this particular way, the consensus halts.

This is easier to happen in stage where we have f=0 and nodes are restarted several times a day. The solution to this is quite easy, we just keep resending timeouts in the bft component. It doesn't really alter our spec, it's just an implementation detail.
But this got me thinking about the deeper issue here, which is that we can't guarantee eventual delivery of messages. No matter if we persist all the messages to a database, or try to be really careful about where we send messages, it is always possible that a node just forgets all messages that it is trying to send. If we assume that arbitrary messages can be dropped, then it's trivial to halt the consensus:

  1. All replicas timeout.
  2. Subset of replicas A receives all timeouts and moves to next view.
  3. Subset of replicas B receives no messages from the other subset. No timeouts or new views.
  4. Now GST happens and no messages are being dropped.
  5. Set A and B are in different views. They are all sending timeouts forever without any set reaching a quorum. Unfortunately set A does not resend new view messages since they already sent them when they changed view. If they did, set B would synchronize with set A.

We should change our assumption from "After GST all messages are delivered including past ones" to "After GST all messages start being delivered". This would strengthen our security model to allow messages being dropped until GST. This is a more realistic model for the networks we operate in and kind of a stronger form of the partially synchronous model.
The changes are minimal, we need to keep resending timeouts like I said before. In addition, whenever we timeout we also send a new view message. So, we basically run a loop sending new view and timeout messages until we progress. It is simple to see that this would work. For any set of replicas in different views and with no memory of past messages, they will keep sending new views until they all synchronize to the same view and then the timeouts will move everyone to the next view.

@brunoffranca brunoffranca marked this pull request as draft January 17, 2025 18:25
@brunoffranca brunoffranca marked this pull request as ready for review January 20, 2025 19:23
@brunoffranca brunoffranca changed the title Bf stage fix fix: Replicas forget messages and halt on restart Jan 20, 2025
@brunoffranca brunoffranca merged commit c723fbe into main Jan 20, 2025
7 checks passed
@brunoffranca brunoffranca deleted the bf-stage-fix branch January 20, 2025 20:55
brunoffranca pushed a commit that referenced this pull request Jan 20, 2025
🤖 I have created a release *beep* *boop*
---


##
[0.8.0](v0.7.0...v0.8.0)
(2025-01-20)


### Features

* documented the dangers of ordering in message encoding
([#225](#225))
([c1b8e6e](c1b8e6e))


### Bug Fixes

* Replicas forget messages and halt on restart
([#228](#228))
([c723fbe](c723fbe))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant