-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failsafes to prevent a consensus round from taking too long #5277
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5277 +/- ##
=======================================
Coverage 78.2% 78.2%
=======================================
Files 790 790
Lines 67639 67689 +50
Branches 8160 8154 -6
=======================================
+ Hits 52869 52927 +58
+ Misses 14770 14762 -8
|
src/xrpld/consensus/DisputedTx.h
Outdated
newPosition = weight > p.avSTUCK_CONSENSUS_PCT; | ||
else | ||
newPosition = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so simple that it's obviously correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been rewritten a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, the ending newPosition = false;
remained, and I like that.
src/xrpld/consensus/Consensus.cpp
Outdated
@@ -181,6 +181,12 @@ checkConsensus( | |||
return ConsensusState::MovedOn; | |||
} | |||
|
|||
if (currentAgreeTime > parms.ledgerMAX_CONSENSUS + previousAgreeTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that because of this condition here, we are unable to test the change in DisputedTx.h
- can you engineer timings such that we will test the last newPosition = false
in DisputedTx::updateVote
as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that because of this condition here, we are unable to test the change in
DisputedTx.h
- can you engineer timings such that we will test the lastnewPosition = false
inDisputedTx::updateVote
as well ?
This has been revised, too.
f07992d
to
76c27a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to have a unit test for the last part of DisputedTx.h
; not sure how realistic that request is. Approved in any case.
6e513d9
to
26ab221
Compare
8dcbf91
to
a6d3cea
Compare
- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.
a6d3cea
to
197356b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map
pairs. There are two viable fixes:
-
Add an assignment operator to
ConsensusParms
:
We could add a custom assignment operator forConsensusParms
that, instead of assigning theavalancheCutoffs
map directly (which triggers the error), manually copies its contents into the target map. -
Make
avalancheCutoffs
const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does
peer->consensusParms = parms;
so that it no longer tries to perform such an assignment.
std::size_t const consensusPct; | ||
AvalancheState const next; | ||
}; | ||
std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To work around std::map copying in the libstdc++ on MacOS:
std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs = { | |
const std::map<AvalancheState, AvalancheCutoff> avalancheCutoffs{ |
@@ -589,6 +652,7 @@ class Consensus_test : public beast::unit_test::suite | |||
{ | |||
using namespace csf; | |||
using namespace std::chrono; | |||
testcase("consensus close time rounding"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re-assignment below in the test
for (Peer* peer : network)
peer->consensusParms = parms;
does not make sense and should be removed, since parms is not modified from its default values.
It, however, conflicts with the other suggested change to make avalancheCutoffs
const
* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)
High Level Overview of Change
This PR, if merged, introduces two fail safes into the consensus logic to prevent a consensus round from remaining open indefinitely.
ledgerMAX_CONSENSUS
(15 seconds) andledgerABANDON_CONSENSUS
(60 seconds). This prevents an unusually fast consensus round from being punished into aborting unusually early on the next round, and prevents the potential round time from growing without bound. i.e. If one round takes 60 seconds, we don't want to let the next round run for 10 minutes.Context of Change
At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.
The current evidence indicates that two things happened.
This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.
This decision algorithm has been in place for at least 8 years, and possibly since the first release of
rippled
. The odds of it happening were thought to be 0, but it turns out they're just very very small.Type of Change
This change is fully backward and forward compatible, and does not require an amendment.