You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part 1. Bootstrapping generates large volumes of commit logs
During bootstrapping nodes generate large volumes of commit logs that cannot be cleaned up, as we don't snapshot during bootstrap. This causes significant disk usage for no good reason, as we don't currently read commit logs if the node was to crash and restart anyway (if the node is still initializing).
Commit logs can only be cleaned up after snapshots complete, but snapshots cannot be completed during bootstrap due to the fact that each individual snapshots must represent a complete block.
Part 2. Out of order writes will not get read by a joining node since commit log/snapshots never get read on bootstrap
For out of order writes that are written during a node join, the DB node relies on the fact it takes writes for shards it is bootstrapping from peers. This depends on the node holding the write in memory and flushing it to disk after a successful bootstrap, if it fails before a complete bootstrap, since we do not read the commit logs when a shard is initializing (node joining), it will lose that write.
Proposed Solution
Part 1. Refactor snapshots to be able to be written out during bootstrap (and for node to flush blocks to disk)
If snapshots were able to be combined when needed, then we would be able to snapshot during bootstraps since we can reliably depend on a full set of data being captured by the snapshot on a consequent bootstrap.
This could be done in an optimistic way in which "most of the time" the node can determine if it's a partial or full set of data for the block (i.e. work out if the snapshots on disk were written out by the current process, or if it was a previous instance of the process, then it must be partial and needs to be combined). This would allow for snapshots in the steady state to continue to avoid needing any merging when being written to disk.
This will fix problem part (1) and the node should no longer use unnecessarily large volumes of the disk when joining the cluster takes significant time.
Note: We cannot fulfill all of this until solution (3) is implemented, since it would cause collisions with the incremental peer bootstrapped block.
Part 2. Bootstrapping a joining node must include a normal bootstrap (i.e. read fs blocks, read commit logs, then peer stream)
This will ensure any out of write that will remain durable across restarts.
This will fix problem part (2).
Part 3. Peer bootstrapping should merge data with any existing fileset files (if performing incremental peer bootstrap)
This is also required to fix problem part (2).
Part 4. Once solution (2) and (3) are done, writes to a leaving/joining pair node can be counted towards quorum
This greatly improves availability for writes and should also improve write latency during a node add.
Milestones
1. Combining snapshots on write if existing snapshot exists on disk not written out by the current process
This will ensure at least that if snapshots and commit logs are used for bootstrapping, they will not potentially have only partial snapshots of data accrued during the bootstrapping process.
This is a partial implementation of proposed solution part (1) and will address problem part (1).
2. Implement remaining proposed solutions
Implementing the remaining proposed solutions should fix remaining problem part (2).
The text was updated successfully, but these errors were encountered:
robskillington
changed the title
M3DB Flush Lifecycle for Bootstrapping + Out of Order Writes
M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability
Apr 30, 2019
robskillington
changed the title
M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability
M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes
Apr 30, 2019
Problem
This proposal addresses the following problems:
Part 1. Bootstrapping generates large volumes of commit logs
During bootstrapping nodes generate large volumes of commit logs that cannot be cleaned up, as we don't snapshot during bootstrap. This causes significant disk usage for no good reason, as we don't currently read commit logs if the node was to crash and restart anyway (if the node is still initializing).
Commit logs can only be cleaned up after snapshots complete, but snapshots cannot be completed during bootstrap due to the fact that each individual snapshots must represent a complete block.
Part 2. Out of order writes will not get read by a joining node since commit log/snapshots never get read on bootstrap
For out of order writes that are written during a node join, the DB node relies on the fact it takes writes for shards it is bootstrapping from peers. This depends on the node holding the write in memory and flushing it to disk after a successful bootstrap, if it fails before a complete bootstrap, since we do not read the commit logs when a shard is initializing (node joining), it will lose that write.
Proposed Solution
Part 1. Refactor snapshots to be able to be written out during bootstrap (and for node to flush blocks to disk)
If snapshots were able to be combined when needed, then we would be able to snapshot during bootstraps since we can reliably depend on a full set of data being captured by the snapshot on a consequent bootstrap.
This could be done in an optimistic way in which "most of the time" the node can determine if it's a partial or full set of data for the block (i.e. work out if the snapshots on disk were written out by the current process, or if it was a previous instance of the process, then it must be partial and needs to be combined). This would allow for snapshots in the steady state to continue to avoid needing any merging when being written to disk.
This will fix problem part (1) and the node should no longer use unnecessarily large volumes of the disk when joining the cluster takes significant time.
Note: We cannot fulfill all of this until solution (3) is implemented, since it would cause collisions with the incremental peer bootstrapped block.
Part 2. Bootstrapping a joining node must include a normal bootstrap (i.e. read fs blocks, read commit logs, then peer stream)
This will ensure any out of write that will remain durable across restarts.
This will fix problem part (2).
Part 3. Peer bootstrapping should merge data with any existing fileset files (if performing incremental peer bootstrap)
This is also required to fix problem part (2).
Part 4. Once solution (2) and (3) are done, writes to a leaving/joining pair node can be counted towards quorum
This greatly improves availability for writes and should also improve write latency during a node add.
Milestones
1. Combining snapshots on write if existing snapshot exists on disk not written out by the current process
This will ensure at least that if snapshots and commit logs are used for bootstrapping, they will not potentially have only partial snapshots of data accrued during the bootstrapping process.
This is a partial implementation of proposed solution part (1) and will address problem part (1).
2. Implement remaining proposed solutions
Implementing the remaining proposed solutions should fix remaining problem part (2).
The text was updated successfully, but these errors were encountered: