Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes #9

Open
robskillington opened this issue Apr 30, 2019 · 0 comments

Comments

@robskillington
Copy link
Contributor

robskillington commented Apr 30, 2019

Problem

This proposal addresses the following problems:

Part 1. Bootstrapping generates large volumes of commit logs

During bootstrapping nodes generate large volumes of commit logs that cannot be cleaned up, as we don't snapshot during bootstrap. This causes significant disk usage for no good reason, as we don't currently read commit logs if the node was to crash and restart anyway (if the node is still initializing).

Commit logs can only be cleaned up after snapshots complete, but snapshots cannot be completed during bootstrap due to the fact that each individual snapshots must represent a complete block.

Part 2. Out of order writes will not get read by a joining node since commit log/snapshots never get read on bootstrap

For out of order writes that are written during a node join, the DB node relies on the fact it takes writes for shards it is bootstrapping from peers. This depends on the node holding the write in memory and flushing it to disk after a successful bootstrap, if it fails before a complete bootstrap, since we do not read the commit logs when a shard is initializing (node joining), it will lose that write.

Proposed Solution

Part 1. Refactor snapshots to be able to be written out during bootstrap (and for node to flush blocks to disk)

If snapshots were able to be combined when needed, then we would be able to snapshot during bootstraps since we can reliably depend on a full set of data being captured by the snapshot on a consequent bootstrap.

This could be done in an optimistic way in which "most of the time" the node can determine if it's a partial or full set of data for the block (i.e. work out if the snapshots on disk were written out by the current process, or if it was a previous instance of the process, then it must be partial and needs to be combined). This would allow for snapshots in the steady state to continue to avoid needing any merging when being written to disk.

This will fix problem part (1) and the node should no longer use unnecessarily large volumes of the disk when joining the cluster takes significant time.

Note: We cannot fulfill all of this until solution (3) is implemented, since it would cause collisions with the incremental peer bootstrapped block.

Part 2. Bootstrapping a joining node must include a normal bootstrap (i.e. read fs blocks, read commit logs, then peer stream)

This will ensure any out of write that will remain durable across restarts.

This will fix problem part (2).

Part 3. Peer bootstrapping should merge data with any existing fileset files (if performing incremental peer bootstrap)

This is also required to fix problem part (2).

Part 4. Once solution (2) and (3) are done, writes to a leaving/joining pair node can be counted towards quorum

This greatly improves availability for writes and should also improve write latency during a node add.

Milestones

1. Combining snapshots on write if existing snapshot exists on disk not written out by the current process

This will ensure at least that if snapshots and commit logs are used for bootstrapping, they will not potentially have only partial snapshots of data accrued during the bootstrapping process.

This is a partial implementation of proposed solution part (1) and will address problem part (1).

2. Implement remaining proposed solutions

Implementing the remaining proposed solutions should fix remaining problem part (2).

@robskillington robskillington changed the title M3DB Flush Lifecycle for Bootstrapping + Out of Order Writes M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability Apr 30, 2019
@robskillington robskillington changed the title M3DB Flush Lifecycle during Bootstrapping fixes + Out of Order Writes durability M3DB Bootstrapping Flush Lifecycle fixes + Bootstrapping Out of Order Writes durability fixes Apr 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant