Failsafes to prevent a consensus round from taking too long #5277

ximinez · 2025-02-05T04:14:24Z

High Level Overview of Change

This PR, if merged, introduces two fail safes into the consensus logic to prevent a consensus round from remaining open indefinitely.

Currently, if a disputed transaction remains disputed for at least 2x the time of the previous consensus round, the percentage of UNL validators required to vote "yes" to keep it in the set rises to 95%. This PR adds two additional cutoffs:
1. If the transaction remains disputed for 4x the previous round, the percentage rises to 100%.
2. Further, while it should be impossible, if the dispute remains unresolved for 5x, every node changes its vote to "no".
Additionally, if the round as a whole takes more than 10x the time of the previous round (bounded just in case), then the round is considered "expired", and the node will leave the round, send a "partial validation" (indicating that the node is moving on without validating), and start the next round. When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together.
- The 10x time is bounded by ledgerMAX_CONSENSUS (15 seconds) and ledgerABANDON_CONSENSUS (60 seconds). This prevents an unusually fast consensus round from being punished into aborting unusually early on the next round, and prevents the potential round time from growing without bound. i.e. If one round takes 60 seconds, we don't want to let the next round run for 10 minutes.
- There was discussion of adding a random factor into whether the node decides to leave the round. I decided against that for now because there's already a lot of variation in consensus round times, and magnifying that by 10 seemed good enough. Let me know if you disagree.

Context of Change

At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.

The current evidence indicates that two things happened.

Some disputed transactions had just enough "yes" votes that validators voting "yes" saw the approval as just over 95%, while those voting "no" saw the approval as just under 95%. Thus, every node thought that it was doing the right thing, and no nodes changed their vote. While this is annoying, normally consensus will move on because at least 80% of the UNL validators will be in agreement over which transaction set to use, and so consensus moves on with that set. However,
The disputed transactions with the close approval rates were distributed such that there were several clumps of validators voting yes for different transactions than other clumps of validators. This led to a situation where no transaction set had 80% approval.

This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

This decision algorithm has been in place for at least 8 years, and possibly since the first release of rippled. The odds of it happening were thought to be 0, but it turns out they're just very very small.

Type of Change

Bug fix (non-breaking change which fixes an issue)

This change is fully backward and forward compatible, and does not require an amendment.

codecov · 2025-02-05T04:39:22Z

Codecov Report

Attention: Patch coverage is 89.47368% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.2%. Comparing base (a079bac) to head (5108e55).

Files with missing lines	Patch %	Lines
src/xrpld/app/misc/NetworkOPs.cpp	0.0%	6 Missing ⚠️
src/xrpld/consensus/Consensus.h	95.7%	1 Missing ⚠️
src/xrpld/consensus/DisputedTx.h	95.7%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5277   +/-   ##
=======================================
  Coverage     78.2%   78.2%           
=======================================
  Files          790     790           
  Lines        67639   67689   +50     
  Branches      8160    8154    -6     
=======================================
+ Hits         52869   52927   +58     
+ Misses       14770   14762    -8

Files with missing lines	Coverage Δ
src/xrpld/app/consensus/RCLValidations.cpp	`74.5% <100.0%> (+0.3%)`	⬆️
src/xrpld/consensus/Consensus.cpp	`98.5% <100.0%> (+0.2%)`	⬆️
src/xrpld/consensus/ConsensusParms.h	`100.0% <100.0%> (ø)`
src/xrpld/consensus/ConsensusTypes.h	`74.4% <ø> (ø)`
src/xrpld/consensus/Consensus.h	`91.4% <95.7%> (+2.5%)`	⬆️
src/xrpld/consensus/DisputedTx.h	`96.6% <95.7%> (+0.6%)`	⬆️
src/xrpld/app/misc/NetworkOPs.cpp	`70.0% <0.0%> (-0.2%)`	⬇️

... and 4 files with indirect coverage changes

Bronek · 2025-02-05T19:51:22Z

src/xrpld/consensus/DisputedTx.h

            newPosition = weight > p.avSTUCK_CONSENSUS_PCT;
+        else
+            newPosition = false;


This is so simple that it's obviously correct.

This has been rewritten a bit

Still, the ending newPosition = false; remained, and I like that.

Bronek · 2025-02-05T19:52:56Z

src/xrpld/consensus/Consensus.cpp

@@ -181,6 +181,12 @@ checkConsensus(
        return ConsensusState::MovedOn;
    }

+    if (currentAgreeTime > parms.ledgerMAX_CONSENSUS + previousAgreeTime)


I think that because of this condition here, we are unable to test the change in DisputedTx.h - can you engineer timings such that we will test the last newPosition = false in DisputedTx::updateVote as well ?

I think that because of this condition here, we are unable to test the change in DisputedTx.h - can you engineer timings such that we will test the last newPosition = false in DisputedTx::updateVote as well ?

This has been revised, too.

Bronek

Would be cool to have a unit test for the last part of DisputedTx.h ; not sure how realistic that request is. Approved in any case.

- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.

vlntb

The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map pairs. There are two viable fixes:

Add an assignment operator to ConsensusParms:
We could add a custom assignment operator for ConsensusParms that, instead of assigning the avalancheCutoffs map directly (which triggers the error), manually copies its contents into the target map.
Make avalancheCutoffs const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does

peer->consensusParms = parms;

so that it no longer tries to perform such an assignment.

vlntb · 2025-02-12T14:33:38Z

src/xrpld/consensus/ConsensusParms.h

+        std::size_t const consensusPct;
+        AvalancheState const next;
+    };
+    std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs = {


To work around std::map copying in the libstdc++ on MacOS:

Suggested change

std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs = {

const std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs{

vlntb · 2025-02-12T14:38:56Z

src/test/consensus/Consensus_test.cpp

@@ -589,6 +652,7 @@ class Consensus_test : public beast::unit_test::suite
    {
        using namespace csf;
        using namespace std::chrono;
+        testcase("consensus close time rounding");


The re-assignment below in the test

for (Peer* peer : network) peer->consensusParms = parms;

does not make sense and should be removed, since parms is not modified from its default values.
It, however, conflicts with the other suggested change to make avalancheCutoffs const

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

ximinez changed the title ~~Drop out of consensus if the round takes too long~~ Failsafes to prevent a consensus round from taking too long Feb 5, 2025

ximinez requested review from Bronek, JoelKatz and vlntb February 5, 2025 19:01

Bronek reviewed Feb 5, 2025

View reviewed changes

ximinez force-pushed the ximinez/consensus branch from f07992d to 76c27a0 Compare February 5, 2025 22:05

ximinez marked this pull request as ready for review February 5, 2025 23:16

ximinez requested a review from Bronek February 6, 2025 00:02

Bronek approved these changes Feb 6, 2025

View reviewed changes

ximinez force-pushed the ximinez/consensus branch 4 times, most recently from 6e513d9 to 26ab221 Compare February 11, 2025 01:15

bthomee added this to the 2.4.0 (Q1 2025) milestone Feb 11, 2025

ximinez force-pushed the ximinez/consensus branch 2 times, most recently from 8dcbf91 to a6d3cea Compare February 12, 2025 04:12

ximinez added 2 commits February 11, 2025 23:13

Drop out of consensus if the round takes too long

60de826

ximinez force-pushed the ximinez/consensus branch from a6d3cea to 197356b Compare February 12, 2025 04:13

ximinez added 2 commits February 12, 2025 00:18

[WIP] Consensus tests

8ad2cb3

[WIP] Fix builds

7e2fa5c

vlntb requested changes Feb 12, 2025

View reviewed changes

vlntb reviewed Feb 12, 2025

View reviewed changes

ximinez added 2 commits February 12, 2025 11:35

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

196f6b6

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

Update levelization

5108e55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failsafes to prevent a consensus round from taking too long #5277

Failsafes to prevent a consensus round from taking too long #5277

ximinez commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

Bronek Feb 5, 2025

ximinez Feb 5, 2025

Bronek Feb 6, 2025

Bronek Feb 5, 2025

ximinez Feb 5, 2025

Bronek left a comment

vlntb left a comment

vlntb Feb 12, 2025

vlntb Feb 12, 2025 •

edited

Loading

	std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs = {
	const std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs{

Failsafes to prevent a consensus round from taking too long #5277

Are you sure you want to change the base?

Failsafes to prevent a consensus round from taking too long #5277

Conversation

ximinez commented Feb 5, 2025 • edited Loading

High Level Overview of Change

Context of Change

Type of Change

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

Bronek Feb 5, 2025

Choose a reason for hiding this comment

ximinez Feb 5, 2025

Choose a reason for hiding this comment

Bronek Feb 6, 2025

Choose a reason for hiding this comment

Bronek Feb 5, 2025

Choose a reason for hiding this comment

ximinez Feb 5, 2025

Choose a reason for hiding this comment

Bronek left a comment

Choose a reason for hiding this comment

vlntb left a comment

Choose a reason for hiding this comment

vlntb Feb 12, 2025

Choose a reason for hiding this comment

vlntb Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

ximinez commented Feb 5, 2025 •

edited

Loading

codecov bot commented Feb 5, 2025 •

edited

Loading

vlntb Feb 12, 2025 •

edited

Loading