-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed sequencer algorithm #463
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
|
||
### Coordinator selection | ||
|
||
This is a form of leader election but is achieved through deterministic calculation for a given point in time ( block height), given predefined members of the group, and a common (majority) awareness of the current availability of each member. So there is no actual "election" term needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pedantic words thing, but leader election is the outcome of the distributed algorithm, rather than the thing that is performed directly by Coordinator selection on each given node.
e.g. leader election is not performed using coordinator selection on a single node in isolation. Rather leader election is a/the property that emerges from every node following the full specification in this document, and one important component of that specification is a deterministic decision made independently by each node to come to the same conclusion about a set of data (including a block number that changes over time inconsistently across the nodes, and static configuration including the committee that is agreed deterministically by recording it to the blockchain at contract deployment time).
|
||
This is a form of leader election but is achieved through deterministic calculation for a given point in time ( block height), given predefined members of the group, and a common (majority) awareness of the current availability of each member. So there is no actual "election" term needed. | ||
|
||
Each node is allocated positions on a hash ring. These positions are a function of the node name. Given that the entire committee agrees on the names of all nodes, all nodes independently arrive at the same decision about the positions of all nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note above about needing to be really specific about if it is node name
or committee member identifier
For each Pente domain instance ( smart contract API) the coordinator is selected by choosing one of the nodes in the configured privacy group of that contract | ||
|
||
- the selector is a pure function where the inputs are the node names + the current block number and the output is the name of one of the nodes | ||
- the function will return the same output for a range of `n` consecutive blocks ( where `n` is a configuration parameter of the policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have to solve for it in the first version of this specification, but I note that the ability to change n
over time is something that might be necessary for tuning the algorithm on long-running privacy groups.
It probably fits into a wider note on "changing configuration"
|
||
When a node starts acting as the coordinator, then it periodically sends a heartbeat message to all nodes that for which it has an active delegation and will accept delegation requests from any other node in the group. | ||
|
||
All sender nodes keep track of the current selected coordinator ( either by reevaluating the deterministic function periodically or caching the result until block height exceeds the current range boundary). If they fail to receive a heartbeat message from the current coordinator, then they will chose the next closest node in the hashring (e.g. by reevaluating the hashring function with the unavailable nodes removed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intended behavior of the algorithm in this scenario?
- Highest ranked coordinator is B
- D delegates transaction to B
- B stops for a period
- Longer than the heartbeat interval
- Shorter than it takes for the block-range to change
- A is the second-highest scored coordinator
- D delegates transaction to A (because it gives up on B)
- B becomes available again
- C delegates transaction to B
At this point D has delegated a TX to A, but C has delegated to B.
I haven't (yet at this point in the spec) understood what would cause A and B to notice that they are both active (because at the time A became the coordinator, B was not available to ask).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok found this statement below:
If Node A comes back online, then it will begin transmitting heartbeat messages again because it has selected itself as coordinator given the current block height.
This seems complex to implement - what would cause A (B in my above sorry) to evaluate that it selected itself as the coordinator? It does not have any transactions after the the restart. Why would this contract (of the infinite potential contracts) be one after restart that it would know to select itself as a coordinator for?
|
||
![image](../images/distributed_sequencer_switchover_frame1.svg){ width="500" } | ||
|
||
As all nodes in the group ( including node A) receive receipts of block 3, they will recognize node B as the new coordinator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand this sentence.
This implies some future inference model and/or state tracking that seems very complex to achieve, and it's unclear to me the justification for this.
I note I believe this different statement is true, but it's currently the only thing I understand to be true with respect to evaluation of block height:
When each node in the group is notified that the block height of the chain has reached 4 (happens at different times on each node) they will recognize node B as the highest scoring coordinator when they have reason to run the scoring algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok - I think I understand.
You are saying that the the combination of these two things will cause the scoring algorithm to occur:
- Detection of block height moving to 4
- The fact that transactions are unresolved for this contract address
This implies the algorithm re-evaluates the local "sender state" on detection of each block.
However, I'm worried I'm missing some detail as you say:
receive receipts of block 3
I'm not clear what receiving receipts means in this case, how it relates to block confirmations, or why it's block 3 (rather than 4) that's relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block numbers in the diagram start at 0 therefore block 3 is the last block in the first range. Early versions of these diagrams had x+0
,x+1
etc... where x
was the block height when the privacy group was created but I felt that the x+..
was adding noise to the diagrams with no added value so I simplified it to be an absolute numbering and it happened to start at 0.
So when we see that block 3 has been mined, we can assume that any further transactions should come into the second range. Given that block 0 is the genesis block, it is probably not very helpful for me to be including it in the first range so I'll change it the numbering system to start at 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexing wasn't the problem. Starting from zero was fine for me - we were referring to the same block.
I think you just spelt "Detection of block height moving to 4" as "receive receipts of block 3".
If instead you had said "confirmation of block 3 being mined, so the next block is now block 4" I'd have been all square.
|
||
The final transmission from node A, as coordinator, is a message to node B with details of the speculative domain context predicting confirmation of transactions `A-2` , `B-2`, `C-2` and `D-2` . This includes the state IDs for any new states created by those transactions and any states that are spent by those transaction. This message also includes the transaction hash for the final transaction that is submitted on the sequence coordinated by node A. Node B needs to receive this message so that it can proceed with maximum efficiency. Therefore this message is sent with an assured delivery quality of service (similar to state distribution). | ||
|
||
Meanwhile, all nodes delegate transactions `A-3` , `B-3`, `C-3` and `D-3` to node B. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand the assumptions for this "meanwhile".
What makes it safe to send anything to B, if you are not absolutely certain that A has transferred your previous transactions to B?
The message from C->B that is sent by C because DelegationReturned
was received by C could (in fact in the above sequence is reasonably likely) to reach B, before the message from A->B that provides information about B's previous transactions.
I wonder, given these facts:
DelegationReturned
will be sent by A->C after any message (sorry couldn't see the name) forI've taken ownership of submission
forC-2
is sent from A->C- The sender is responsible for their transactions
- The sender thus will know at the point it sends C-3 to B, that C-2 has reached the point of confirmed delegation on A
... whether C should just package up the information about which confirmed delegations it has on A, up to B.
TODO | ||
|
||
- need more detail on precisely _how_ nodes B, C and D know that transactions `B-2``A-2` , `B-2`, `C-2` and `D-2` are past the point of no return and that transactions `A-3` , `B-3`, `C-3` and `D-3` do need to be re-delegated. | ||
- need more detail on precisely _how_ node B can continue to coordinate the assembly of transactions `A-3` , `B-3`, `C-3` and `D-3` in a domain context that is aware of the speculative states locks from transactions `B-2``A-2` , `B-2`, `C-2` and `D-2` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start to use domain context
as a specification level thing (vs. just an implementation detail to meet the requirements of the spec) we will need to define what it is and how it works.
|
||
- need more detail on precisely _how_ nodes B, C and D know that transactions `B-2``A-2` , `B-2`, `C-2` and `D-2` are past the point of no return and that transactions `A-3` , `B-3`, `C-3` and `D-3` do need to be re-delegated. | ||
- need more detail on precisely _how_ node B can continue to coordinate the assembly of transactions `A-3` , `B-3`, `C-3` and `D-3` in a domain context that is aware of the speculative states locks from transactions `B-2``A-2` , `B-2`, `C-2` and `D-2` | ||
- It might be more productive for the next level of detail on these points to come in the form of a proposal in code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code to experiment with behavior is great... we will need to come back and formally address the behavior of the specification afterwards.
- If the sender node is ahead, it continues to retry the delegation until the delegate node finally catches up and accepts the delegation | ||
- If the sender node is behind, it waits until its block indexer catches up and then selects the coordinator for the new range | ||
- Coordinator node will continue to coordinate ( send endorsement requests and submit endorsed transactions to base ledger) until its block indexer has reached a block number that causes the coordinator selector to select a different node. | ||
- at that time, it waits until all dispatched transactions are confirmed on chain, then delegates all current inflight transactions to the new coordinator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nesting of sub-bullets would help readability here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, see comments above on my questions around "delegates all current inflight transactions to the new coordinator" being between coordinators. e.g. in the algorithm as described sometimes its the sender that delegates, and sometimes its another coordinator, and I'm not sure we've worked through that fully yet (or why we've chosen that approach vs. it's always the sender that delegates).
- Coordinator node will continue to coordinate ( send endorsement requests and submit endorsed transactions to base ledger) until its block indexer has reached a block number that causes the coordinator selector to select a different node. | ||
- at that time, it waits until all dispatched transactions are confirmed on chain, then delegates all current inflight transactions to the new coordinator. | ||
- if the new coordinator is not up to date with block indexing, then it will reject and the delegation will be retried until it catches up. | ||
- while a node is the current selected coordinator, it sends endorsement requests to every other node for every transaction that it is coordinating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an important part of the algorithm for sure, but I'm not sure why the spec for it is under variation in block height
Ok - understood now with the below bullet
- if not, then it rejects the endorsement and includes its view of the current block number in the rejection message
Maybe I'm wondering if this fact should have been mentioned earlier.
I did a search and endorsement is mentioned a lot earlier as something that comes out of the spec, but not really in these concrete terms.
|
||
#### Simple happy path | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool 🚀
|
||
### Sender's responsibility | ||
|
||
The sender node for any given transaction remains ultimately responsible for ensuring that transaction is successfully confirmed on chain or finalized as failed if it is not possible to complete the processing for any reason. While the coordination of assembly and endorsement is delegated to another node, the sender continues to monitor the progress and is responsible for initiating retries or re-delegation to other coordinator nodes as appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or re-delegation
Again we have the inconsistency here. This paragraph states that re-delegation is the sender's responsibility, but as discussed in comments above we seem to have split this responsibility between sender and coordinator.
As I've mentioned above, I think it would be a simpler to understand spec if it was always the sender's responsibility as this paragraph indicates.
Feedback available to the sender node that can be monitored to track the progress or otherwise of the transaction submission: | ||
|
||
- when the sender node is choosing the coordinator, it may have recently received a heartbeat message from the preferred coordinator or an alternative coordinator | ||
- when sending the delegation request to the coordinator, the sender node expects to receive an acknowledgement that the request has been received. This is not a guarantee that the transaction will be completed. At this point, the coordinator has only an in-memory record of that delegated transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear so far whether if a node has A-1 and A-2, if it is responsible for waiting for the ack of A-1 being delegated, before delegating A-2. This is important in edge cases such as failover, as discussed above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I need to add some details generally to explain how explicit dependencies are handled.
- when the sender node is choosing the coordinator, it may have recently received a heartbeat message from the preferred coordinator or an alternative coordinator | ||
- when sending the delegation request to the coordinator, the sender node expects to receive an acknowledgement that the request has been received. This is not a guarantee that the transaction will be completed. At this point, the coordinator has only an in-memory record of that delegated transaction | ||
- coordinator heartbeat messages. The payload of these messages contains a list of transaction IDs that the coordinator is actively coordinating | ||
- transaction confirmation request. Once the coordinator has fulfilled the attestation plan, it sends a message to the transaction sender requesting permission to dispatch. If, for any reason, the sender has already re-delegated to another coordinator, then it will reject this request otherwise, it will accept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to answer some of my earlier questions, but it also seems inconsistent again. This implies it's the sender that re-delegates (whereas the above diagrams implied some direct old-coordinator <-> new-coordinator comms)
|
||
Decisions and actions that need to be taken by the sender node | ||
|
||
- When a user sends a transaction intent (`ptx_sendTransaction` or `ptx_prepareTransaction`), the sender node needs to chose which coordinator to delegate to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A detail point - but no transaction can be sent to a coordinator, until any referenced dependencies have been confirmed.
- If the coordinator node seems to have forgotten about the transaction, then the sender node needs to decide to re-delegate it | ||
- If the preferred coordinator node becomes unavailable then the sender node needs to decide which alternative coordinator to delegate to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the difference between detection of "becomes unavailable" or "seems to have forgotten about it". I need a bit of help to understand.
I know in the spec, when a coordinator has a transaction it is responsible for regularly sending an "I'm still coordinating transaction X for contract Y" (referred to as "heartbeat") message. So one clear thing a sender can do is detect that it hasn't received an expected "heartbeat". But in that situation how would it know the difference between the coordinator having restarted and forgotten, vs. being unavailable.
- If the block height changes and there is a new preferred coordinator as per the selection algorithm then the sender node needs to decide whether to delegate the transaction to it. This will be dependent on whether the transaction has been dispatched or not. | ||
- If the coordinator node seems to have forgotten about the transaction, then the sender node needs to decide to re-delegate it | ||
- If the preferred coordinator node becomes unavailable then the sender node needs to decide which alternative coordinator to delegate to | ||
- If a transaction has been delegated to an alternative coordinator and the preferred coordinator becomes available again, then the sender needs to decide to re-delegate to the preferred coordinator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered in a detail question earlier this exact scenario. It was unclear to me what would cause a sender to find out "the preferred coordinator becomes available again"
google.protobuf.Timestamp timestamp = 2; | ||
string idempotency_key = 3; | ||
string contract_address = 4; | ||
repeated string transaction_ids = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I assume this is only the transaction IDs that have been delegated by the sender this heartbeat is targeted to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am proposing that it may include other transactions and those should be ignored by the receiver of the message.
##### <a name="message-transaction-delegation-rejected"></a>Transaction delegation rejected | ||
|
||
The handling of a delegation rejected message depends on the reason for rejection. | ||
- if the reason is `MismatchedBlockHeight` and the target delegate is ahead then a new delegation request is sent once the sender has reached a compatible block height. Compatible block height is defined as a block height in the same block range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - think this is confirmation of my above point - the algorithm above is working to "is compatible block height?" (not "same") semantics.
|
||
##### <a name="message-transaction-delegation-accepted"></a>Transaction delegation accepted | ||
|
||
If a node receives a `DelegationAccepted` message then it should start to monitor the continued acceptance of that delegation. It can expect to receive `HeartbeatNotification` messages from the delegate node and for those messages to include the id of the delegated transaction. The `sender` node cannot assume that the `coordinator` node will persist the delegation request. If the heartbeat messages stop or if the received heartbeat messages do not contain the expected transaction ids, then the sender should retrigger the `HandleTransaction` process to cause the transaction to be re-delegated either to the same delegate, or new delegate or to be coordinated by the sender node itself. Whichever is appropriate for the current point in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digressing to an implementation point here... how do we think code could reasonably be constructed for this?
It seems like a sender node will have:
- A database with zero-or-more (maybe 1million) transactions it wishes to get submitted
- Each transaction will be on the same of different contract address (maybe 1million contract addresses)
- Each contract address has a different coordinator committee
- Each transaction might be in one of a number of states (a state machine diagram would be awesome for this)
- Initial state - candidate to be delegated to someone
- Delegated to self - I'm the coordinator so I don't need to healthcheck myself
- Delegated to remote, but not yet requested submission
- Delegated to remote, submission requested, but submission not occurred
- Delegated to remote, submission requested, submission approved, but not yet confirmed
- Delegated to remote, submission confirmed
- While in any of these states, we might detect a the block range changing means this is the "wrong" coordinator
- Done. No need to continue tracking
- The decision on whether an action should be performed for any transaction in this DB could be triggered by:
- A message from the active coordinator for that txn (we have specs for these starting to finalize in this doc)
- The lack of a message from the active coordinator fro that txn (the subject of this section)
- The arrival of a new confirmed block
- The arrival of an on-chain event (from a confirmed block) confirming a transaction
- The completion of an earlier transaction that directly (through dependency) or indirectly (through memory-management of "in-fight" transactions for that contract) unlocks processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our code in Paladin we've chosen to have evaluation loops against the DB and in-memory state, working at various levels. I guess I'm writing this question, because I'm not 100% sure how to correlate those code modules to the state machine(s) implied by this specification that I can see how it fully scales to "mempool management" of a large sea of transactions with efficient DB query and in-memory management.
Potentially this whole question should be moved out of this issue on refinement of a formal specification, into a separate one that is about updating our code to be an complete high-performance implementation of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to pause my review here @hosie because I think there's a lot of outstanding conversations in what I've done so far that is affecting my ability to understand some of the more detailed items like the flow charts.
Thanks @peterbroadhurst these are a great points. I feel bad that that you had trudge through some basic inconsistencies in the writing resulting from in flight course adjustment but hopefully that does eventually help to share the journey in the thinking rather than just the final destination. I'll focus on fixing the inconsistencies first so that the current draft is at least a coherent strawman within itself and then we can pick up the following key conversations points to decide what significant changes are needed: Identity Vs Node as committee members. I think this is a short conversation. I don't have any strong objections to moving completely to How much responsibility is actually delegated. At one point I was of the thinking that when a A consequence of this decision is that it creates a possibility where the Another consequence of this decision is that it is now the responsibility of the sender to detect when the preferred coordinator for the current range comes back online. That hasn't been worked through in the write up yet and I think we need a better strawman before we can have a useful convo on this. Message ID / Correl ID etc Feasibility of efficient, scalable, implementation Hot reconfiguration |
Co-authored-by: Peter Broadhurst <[email protected]> Signed-off-by: John Hosie <[email protected]>
Co-authored-by: Peter Broadhurst <[email protected]> Signed-off-by: John Hosie <[email protected]>
Co-authored-by: Peter Broadhurst <[email protected]> Signed-off-by: John Hosie <[email protected]>
Co-authored-by: Peter Broadhurst <[email protected]> Signed-off-by: John Hosie <[email protected]>
Co-authored-by: Peter Broadhurst <[email protected]> Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
Signed-off-by: John Hosie <[email protected]>
This PR is stale because it has been open 30 days with no activity. |
This PR is stale because it has been open 30 days with no activity. |
This PR introduces the detail documentation of the distributed sequencer protocol.
In domains ( such as Pente) where the spending rules for states allow any one of a group of parties to spend the state, then we need to coordinate the assembly of transactions across multiple nodes so that we can maximize the throughput by speculative spending new states and avoid transactions being reverted due to double concurrent spending / state contention. The distributed sequencer protocol described in this PR is the formal specification of how that operates and the responsibilities and expectations of each node involved.
This goes beyond describing what happens to be implemented in code at this moment in time and aims to be comprehensive in the specification of the algorithm. Further code changes PRs will be opened to bring the code inline with this architecture once it has been agreed.
Includes explicit (but concise) write up of the key architectural decisions made and the consequences of those compared to the alternatives.
Remaining TODOs before moving out of
draft
Tweaks to protocol
dispatchNotification
protocol point and replace with heartbeat messages from submitterdispatchConfirmation
has little value being a persistence point on the sender node so update the diagrams to not assume persistence hereDelegationReturned
message or possibly include it in a "potential future optimizations" section. The protocol should not rely on this.Fill in gaps / corrections in writeup
TransactionDelegation
andEndorsement
message exchangesREQUEST DISPATCH CONFIRMATION
Check availability
sub processEndorsement request
add flowchart and include where the startup message fits inGather endorsement
add flowchartSelect available coordinator
fix diagram and complete descriptionmessage-exchange-transaction-confirmation
Formatting