M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

robskillington · 2019-04-30T21:14:56Z

Problem

This proposal addresses the following problems:

Part 1. Bootstrapping generates large volumes of commit logs

During bootstrapping nodes generate large volumes of commit logs that cannot be cleaned up, as we don't snapshot during bootstrap. This causes significant disk usage for no good reason, as we don't currently read commit logs if the node was to crash and restart anyway (if the node is still initializing).

Commit logs can only be cleaned up after snapshots complete, but snapshots cannot be completed during bootstrap due to the fact that each individual snapshots must represent a complete block.

Part 2. Out of order writes will not get read by a joining node since commit log/snapshots never get read on bootstrap

For out of order writes that are written during a node join, the DB node relies on the fact it takes writes for shards it is bootstrapping from peers. This depends on the node holding the write in memory and flushing it to disk after a successful bootstrap, if it fails before a complete bootstrap, since we do not read the commit logs when a shard is initializing (node joining), it will lose that write.

Proposed Solution

Part 1. Refactor snapshots to be able to be written out during bootstrap (and for node to flush blocks to disk)

If snapshots were able to be combined when needed, then we would be able to snapshot during bootstraps since we can reliably depend on a full set of data being captured by the snapshot on a consequent bootstrap.

This could be done in an optimistic way in which "most of the time" the node can determine if it's a partial or full set of data for the block (i.e. work out if the snapshots on disk were written out by the current process, or if it was a previous instance of the process, then it must be partial and needs to be combined). This would allow for snapshots in the steady state to continue to avoid needing any merging when being written to disk.

This will fix problem part (1) and the node should no longer use unnecessarily large volumes of the disk when joining the cluster takes significant time.

Note: We cannot fulfill all of this until solution (3) is implemented, since it would cause collisions with the incremental peer bootstrapped block.

Part 2. Bootstrapping a joining node must include a normal bootstrap (i.e. read fs blocks, read commit logs, then peer stream)

This will ensure any out of write that will remain durable across restarts.

This will fix problem part (2).

Part 3. Peer bootstrapping should merge data with any existing fileset files (if performing incremental peer bootstrap)

This is also required to fix problem part (2).

Part 4. Once solution (2) and (3) are done, writes to a leaving/joining pair node can be counted towards quorum

This greatly improves availability for writes and should also improve write latency during a node add.

Milestones

1. Combining snapshots on write if existing snapshot exists on disk not written out by the current process

This will ensure at least that if snapshots and commit logs are used for bootstrapping, they will not potentially have only partial snapshots of data accrued during the bootstrapping process.

This is a partial implementation of proposed solution part (1) and will address problem part (1).

2. Implement remaining proposed solutions

Implementing the remaining proposed solutions should fix remaining problem part (2).

robskillington changed the title ~~M3DB Flush Lifecycle for Bootstrapping + Out of Order Writes~~ M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability Apr 30, 2019

robskillington changed the title ~~M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability~~ M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

robskillington commented Apr 30, 2019 •

edited

Loading

M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

Comments

robskillington commented Apr 30, 2019 • edited Loading

Problem

Part 1. Bootstrapping generates large volumes of commit logs

Part 2. Out of order writes will not get read by a joining node since commit log/snapshots never get read on bootstrap

Proposed Solution

Part 1. Refactor snapshots to be able to be written out during bootstrap (and for node to flush blocks to disk)

Part 2. Bootstrapping a joining node must include a normal bootstrap (i.e. read fs blocks, read commit logs, then peer stream)

Part 3. Peer bootstrapping should merge data with any existing fileset files (if performing incremental peer bootstrap)

Part 4. Once solution (2) and (3) are done, writes to a leaving/joining pair node can be counted towards quorum

Milestones

1. Combining snapshots on write if existing snapshot exists on disk not written out by the current process

2. Implement remaining proposed solutions

robskillington commented Apr 30, 2019 •

edited

Loading