-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accumulate candidates and votes from previous iterations, same round #1038
Conversation
…tions, same round
- Try to accept the received block if node has fallbacked to previous block only - Filter out blacklisted blocks in both modes (inSync and OutOfSync) - Blacklist the hash from the fork
consensus/src/execution_ctx.rs
Outdated
} | ||
|
||
if let Err(e) = self.outbound.send(msg.clone()).await { | ||
error!("could not send newblock msg due to {:?}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this adapt to the msg type instead of newblock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
@@ -18,7 +18,9 @@ use tracing::{debug, error, warn}; | |||
/// voters.StepVotes Mapping of a block hash to both an aggregated signatures | |||
/// and a cluster of bls voters. | |||
#[derive(Default)] | |||
pub struct Aggregator(BTreeMap<Hash, (AggrSignature, Cluster<PublicKey>)>); | |||
pub struct Aggregator( | |||
BTreeMap<(u8, Hash), (AggrSignature, Cluster<PublicKey>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need the "step" part in the key.
There's no collision between the block hashes...
Does it have another goal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not introduced we accumulate votes for empty hash during all iterations. Before introducing this fix, I noticed votes for an empty hash were more than 67%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
This is due to the ambiguity of votes. The vote should be separate from the block hash so as to distinguish the case of invalid candidate from that of the candidate not being received. Still, in the case of empty candidate, the iteration/step is indeed needed.
That said, I think nodes should not vote on empty hash. Instead, they should vote NIL only on invalid candidates.
That is:
- if no candidate is received, do not vote
- if a invalid candidate is received, vote NIL on the candidate
Note that voting nil for not having received a block is also in contrast to waiting (and voting) for previous-iteration blocks, as this could lead to the same provisioner voting NIL (because it didn't receive the block) and then vote the block's hash when receiving it while on a later iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I think nodes should not vote on empty hash.
What metrics do you expect to improve by implementing this patch not voting on empty hash.
? Why do you propose it here as a comment instead of posting a DIP with better and complete explanation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it here as part of the discussion. DIP/Issues can always stem from comments :)
I'll do write a proposal for this, but, in short, I think voting on empty hash yields the same issues as not collecting votes from previous iterations.
Either way, if we vote nil when not receiving a block, we shouldn't vote on previous-iteration blocks, or we could produce a double (and opposite) vote on the same block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an Issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the DIP Issue, relative to this PR, if the node votes NIL for a round/iteration after the timeout and then votes on the candidate when receiving it (vote_for_former_candidate
) it would produce two votes for the same block, creating potential indeterminism (nodes can choice any of the two votes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply put,
Nil vote means a vote for tuple (empty hash, round, step)
A real vote means a vote for tuple (candidate_block_hash, round, step)
nodes can choice any of the two votes
Not the case at all. Nodes can reach quorum for an empty hash at iteration 1 and then move forward and reach quorum for candidate block of the iteration 1 while running the second iteration. The output of this case is a valid finalized block.
creating potential indeterminism (nodes can choice any of the two votes).
There is no indeterminism as these are not the same blocks. Also, voting NIL only helps provisioners to move forward when a candidate block is not generated, it has not other impact other than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not currently distinguish between NIL for unknown candidate and NIL for invalid block.
In both cases, the vote will be (round, step, empty_hash).
So, a NIL vote is effectively valid as a negative vote on the candidate, whichever that might be.
If we produce both a NIL and a candidate votes, they would be a condtradictive vote for the same block.
And if I receive both votes I can use either for the quorum, and they would be both valid votes.
In fact, in extreme cases, there could be a quorum on both NIL and the candidate.
Example 1:
- I'm at round/step
- I don't receive the block, so I vote NIL (empty,round,step)
- I then receive the block and vote (candidate,round,step)
Now there are two votes for the same round-step.
Let's say the committee is split in half (half voted NIL, half voted the candidate) and my vote is the one deciding the quorum.
Nodes receiving the NIL vote first will reach quorum on NIL, and move to the next iteration;
Nodes receiving the candidate vote first will reach quorum on the block, and move to the next round.
Both nodes will have a valid quorum for the same round and step, but one is NIL and one is for the candidate.
Example 2:
- let's say all nodes are at round/step
- the candidate is delayed, so all provisioners vote NIL
- then the block is broadcasted and they all vote for it
Depending on which votes they receive first, nodes can either reach quorum on the candidate or on the empty hash.
Not the case at all. Nodes can reach quorum for an empty hash at iteration 1 and then move forward and reach quorum for candidate block of the iteration 1 while running the second iteration. The output of this case is a valid finalized block.
You're only seeing the best-case scenario.
In this case you still have two valid quorums on for the same round/iteration.
If a quorum is reached on iteration 2, it's possible that some nodes will move to the next round while others will accept the first-iteration block and move to the next round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reaching quorum on a tuple (empty_hash, round, step) does not impact reaching quorum for any candidate block of the same round.
I'd suggest to pair up (when you're available) and review with you the implementation to ensure we're on the same page.
node/src/chain/fsm.rs
Outdated
// the network for that broadcast may damage the | ||
// fallback for nodes on different fork. | ||
|
||
self.network.write().await.broadcast(msg).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed.
If consensus accepted the block, it means we already broadcasted it (or at least, we should have).
Also, if we already accepted the block from the network, we should not re-broadcast it to avoid possible DoS attacks (eg by flooding the network with duplicate blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of that. This would be also the case when we implement fallback for any former (non-finalized) block as we'd need to re-broadcast blocks with lower tip.
The reasoning behind this change is: Any node that has accepted a block of 1st iteration will not re-propagate it eclipsing the network for other (non-consensus) guys that are still on the wrong fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the reasoning...
If the block is known, being it because it was received by peers or produced by consensus, it would have been already propagated to the node's peers.
Re-broadcasting the same block when received from the network would be a duplicate.
If nodes accepted that block at the 1st iteration, they will have broadcasted the same message they are receiving now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. I'll remove it once I confirm that in a case of fallback all nodes are receiving the correct block.
…to different iteration
… to different iteration
Closed in favor of PR: #1098 |
To accomplish the experiment's goal, this PR implements following ideas: