Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accumulate candidates and votes from previous iterations, same round #1038

Closed
wants to merge 35 commits into from

Conversation

goshawk-3
Copy link
Contributor

@goshawk-3 goshawk-3 commented Sep 5, 2023

To accomplish the experiment's goal, this PR implements following ideas:

  • consensus: Always keep aggregators state of both reduction steps within the context of a single round.
  • consensus: When a message from a former step arrives, collect it instead of discarding it
  • node-data: Use different topic IDs for first and second reduction message
  • consensus: Generate an agreement for any iteration once step_votes for both reductions are produced.
  • consensus: Vote for a candidate block from a former iteration, same round.
  • node: Broadcast any blocks equal to local tip
  • fallback: Try to accept the received block if a node falls back to previous block only
  • fallback: Filter out blacklisted blocks in both modes (inSync and OutOfSync)
  • fallback: Blacklist the hash from the fork
  • node: Initiate a recovery procedure if a node is on a fork

@goshawk-3 goshawk-3 changed the title Accumulate messages from previous iterations Accumulate candidates and votes from previous iterations, same round Sep 5, 2023
@goshawk-3 goshawk-3 requested a review from fed-franz September 20, 2023 08:50
@goshawk-3 goshawk-3 marked this pull request as ready for review September 20, 2023 09:04
}

if let Err(e) = self.outbound.send(msg.clone()).await {
error!("could not send newblock msg due to {:?}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this adapt to the msg type instead of newblock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@@ -18,7 +18,9 @@ use tracing::{debug, error, warn};
/// voters.StepVotes Mapping of a block hash to both an aggregated signatures
/// and a cluster of bls voters.
#[derive(Default)]
pub struct Aggregator(BTreeMap<Hash, (AggrSignature, Cluster<PublicKey>)>);
pub struct Aggregator(
BTreeMap<(u8, Hash), (AggrSignature, Cluster<PublicKey>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need the "step" part in the key.
There's no collision between the block hashes...

Does it have another goal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not introduced we accumulate votes for empty hash during all iterations. Before introducing this fix, I noticed votes for an empty hash were more than 67%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
This is due to the ambiguity of votes. The vote should be separate from the block hash so as to distinguish the case of invalid candidate from that of the candidate not being received. Still, in the case of empty candidate, the iteration/step is indeed needed.

That said, I think nodes should not vote on empty hash. Instead, they should vote NIL only on invalid candidates.
That is:

  • if no candidate is received, do not vote
  • if a invalid candidate is received, vote NIL on the candidate

Note that voting nil for not having received a block is also in contrast to waiting (and voting) for previous-iteration blocks, as this could lead to the same provisioner voting NIL (because it didn't receive the block) and then vote the block's hash when receiving it while on a later iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I think nodes should not vote on empty hash.

What metrics do you expect to improve by implementing this patch not voting on empty hash.? Why do you propose it here as a comment instead of posting a DIP with better and complete explanation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it here as part of the discussion. DIP/Issues can always stem from comments :)
I'll do write a proposal for this, but, in short, I think voting on empty hash yields the same issues as not collecting votes from previous iterations.
Either way, if we vote nil when not receiving a block, we shouldn't vote on previous-iteration blocks, or we could produce a double (and opposite) vote on the same block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an Issue here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the DIP Issue, relative to this PR, if the node votes NIL for a round/iteration after the timeout and then votes on the candidate when receiving it (vote_for_former_candidate) it would produce two votes for the same block, creating potential indeterminism (nodes can choice any of the two votes).

Copy link
Contributor Author

@goshawk-3 goshawk-3 Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply put,
Nil vote means a vote for tuple (empty hash, round, step)
A real vote means a vote for tuple (candidate_block_hash, round, step)

nodes can choice any of the two votes

Not the case at all. Nodes can reach quorum for an empty hash at iteration 1 and then move forward and reach quorum for candidate block of the iteration 1 while running the second iteration. The output of this case is a valid finalized block.

creating potential indeterminism (nodes can choice any of the two votes).

There is no indeterminism as these are not the same blocks. Also, voting NIL only helps provisioners to move forward when a candidate block is not generated, it has not other impact other than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not currently distinguish between NIL for unknown candidate and NIL for invalid block.
In both cases, the vote will be (round, step, empty_hash).
So, a NIL vote is effectively valid as a negative vote on the candidate, whichever that might be.

If we produce both a NIL and a candidate votes, they would be a condtradictive vote for the same block.
And if I receive both votes I can use either for the quorum, and they would be both valid votes.
In fact, in extreme cases, there could be a quorum on both NIL and the candidate.

Example 1:

  • I'm at round/step
  • I don't receive the block, so I vote NIL (empty,round,step)
  • I then receive the block and vote (candidate,round,step)

Now there are two votes for the same round-step.
Let's say the committee is split in half (half voted NIL, half voted the candidate) and my vote is the one deciding the quorum.
Nodes receiving the NIL vote first will reach quorum on NIL, and move to the next iteration;
Nodes receiving the candidate vote first will reach quorum on the block, and move to the next round.
Both nodes will have a valid quorum for the same round and step, but one is NIL and one is for the candidate.

Example 2:

  • let's say all nodes are at round/step
  • the candidate is delayed, so all provisioners vote NIL
  • then the block is broadcasted and they all vote for it

Depending on which votes they receive first, nodes can either reach quorum on the candidate or on the empty hash.

Not the case at all. Nodes can reach quorum for an empty hash at iteration 1 and then move forward and reach quorum for candidate block of the iteration 1 while running the second iteration. The output of this case is a valid finalized block.

You're only seeing the best-case scenario.
In this case you still have two valid quorums on for the same round/iteration.
If a quorum is reached on iteration 2, it's possible that some nodes will move to the next round while others will accept the first-iteration block and move to the next round.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reaching quorum on a tuple (empty_hash, round, step) does not impact reaching quorum for any candidate block of the same round.

I'd suggest to pair up (when you're available) and review with you the implementation to ensure we're on the same page.

consensus/src/execution_ctx.rs Show resolved Hide resolved
// the network for that broadcast may damage the
// fallback for nodes on different fork.

self.network.write().await.broadcast(msg).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.
If consensus accepted the block, it means we already broadcasted it (or at least, we should have).
Also, if we already accepted the block from the network, we should not re-broadcast it to avoid possible DoS attacks (eg by flooding the network with duplicate blocks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that. This would be also the case when we implement fallback for any former (non-finalized) block as we'd need to re-broadcast blocks with lower tip.

The reasoning behind this change is: Any node that has accepted a block of 1st iteration will not re-propagate it eclipsing the network for other (non-consensus) guys that are still on the wrong fork.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the reasoning...
If the block is known, being it because it was received by peers or produced by consensus, it would have been already propagated to the node's peers.
Re-broadcasting the same block when received from the network would be a duplicate.
If nodes accepted that block at the 1st iteration, they will have broadcasted the same message they are receiving now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I'll remove it once I confirm that in a case of fallback all nodes are receiving the correct block.

@goshawk-3 goshawk-3 mentioned this pull request Oct 2, 2023
@goshawk-3
Copy link
Contributor Author

goshawk-3 commented Oct 23, 2023

Closed in favor of PR: #1098

@goshawk-3 goshawk-3 closed this Oct 23, 2023
@goshawk-3 goshawk-3 deleted the collect_past_messages branch February 8, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants