-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: multi-replica scheduling for singleton sources (aka hot standby) #31205
Open
aljoscha
wants to merge
3
commits into
MaterializeInc:main
Choose a base branch
from
aljoscha:design-multi-replica-singleton-sources
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
100 changes: 100 additions & 0 deletions
100
doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Multi-Replica Scheduling for Singleton Sources (aka hot-standby sources) | ||
|
||
## Context | ||
|
||
We want to support zero-downtime ALTER CLUSTER (aka graceful cluster | ||
reconfiguration). The plan for this is to turn on a new replica with the | ||
changed parameters and turn off the old replica when the new one is | ||
"sufficiently ready". This in turn requires that we are able to run sources and | ||
sinks on multi-replica clusters. This document concerns sources, we are punting | ||
on sinks for now though we might use a very similar solution once we have | ||
addressed [storage: restarting a Sink should not require reading a snapshot of | ||
the input #8603 | ||
](https://github.com/MaterializeInc/database-issues/issues/8603). | ||
|
||
For (UPSERT) Kafka sources we already have a solution in [Feedback | ||
UPSERT](https://github.com/MaterializeInc/materialize/pull/29718), which allows | ||
us to run Kafka sources concurrently. Singleton sources (Postgres and MySQL as | ||
of writing) cannot, as of now, run concurrently: they have state (a slot or | ||
similar) at the upstream system that cannot be shared by concurrent ingestions, | ||
running on different replicas. The good thing about these kinds of sources is | ||
that their running ingestion dataflow is virtually stateless, so restarting | ||
them is fast. We will use this latter fact in our approach to how we want to | ||
support running them on multi-replica clusters. | ||
|
||
The approach we want to take can be summarized as: change the storage | ||
controller to schedule stateless sources on only a single replica _and_ update | ||
that scheduling decision when the replica that is running a source is going | ||
away. See below for why we think this is a feasible and sufficient approach for | ||
a first implementation of this feature. As an aside, for (UPSERT) Kafka sources | ||
we can allow scheduling them on all replicas of a cluster. | ||
|
||
## Goals | ||
|
||
- Support running singleton sources on multi-replica clusters | ||
- Support zero-downtime ALTER on clusters | ||
|
||
## Non-Goals | ||
|
||
- Add a failure detection mechanism for replicas and update source scheduling | ||
based on that | ||
- Changes to the storage protocol | ||
- Changes to how storage handles things on the `clusterd` side, for example to | ||
make it handle concurrent singleton sources purely on the `clusterd`s without | ||
the controller being involved | ||
|
||
## Rationale | ||
|
||
Initially, the main goal of supporting singleton sources on multi-replica | ||
clusters is to support zero-downtime ALTER. For this, the controller knows which | ||
replicas are being created and which are being retired, it is therefore enough | ||
to base scheduling decisions on that. If the goal were enable HA of sources | ||
running on multi-replica clusters, we would want to pursue a different approach | ||
were we really do figure out a way for these types of singleton sources to run | ||
concurrently, such that they can either a) very quickly take over when another | ||
instance fails, or b) truly run concurrently and write the output shard(s) | ||
collaboratively. | ||
|
||
## Grabbag of Questions | ||
|
||
Q: What happens when we accidentally try and schedule a concurrent ingestion | ||
dataflow for a singleton source? Say when the replica processes take longer than | ||
expected to shut down. | ||
|
||
A: This is caught by the mechanisms we already have today for making sure there | ||
is only one active ingestion dataflow. We need this for correctness in the face | ||
of upgrades, or failing/zombie `clusterd` pods. Today, source dataflows use the | ||
state in the upstream system (the slot) to fence out other readers. | ||
|
||
Q: What happens when `envd`/the controller restarts and doesn't know which | ||
replica a source is running on? | ||
|
||
A: The new controller will tell one replica to run the ingestion and | ||
reconciliation will make it so that only that replica _is_ running the source. | ||
If another replica was running the source, it would notice that it is no longer | ||
supposed to run it and stop. As mentioned above, we are optimizing for the case | ||
where there is only one replica running and short periods where we have two | ||
because of ALTER CLUSTER. | ||
|
||
## Implementation | ||
|
||
We're keeping this very light for now, but think that the initial implementation | ||
can be very bare bones. Most (all?) changes need to happen in | ||
[storage-controller/src/instance.rs](https://github.com/MaterializeInc/materialize/blob/2280405a2e1f8a44fa1c8f046a718c013ff7af6b/src/storage-controller/src/instance.rs#L73), | ||
where we have the knowledge about what replicas exist and when they are being | ||
created or destroyed. | ||
|
||
## Alternatives | ||
|
||
### Handle Collaboration in the source dataflow implementation | ||
|
||
We could change the source ingestion dataflows to handle concurrency internally. | ||
They could do something akin to leader election and/or us the state that they | ||
have at the upstream system (the slot or similar) to figure out who should be | ||
reading. They would need some sort of failure detection to figure out when/if to | ||
take over from the "leader" ingestion, which comes with its own collection of | ||
downsides. | ||
|
||
## Open Questions | ||
|
||
tbd! |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽👍🏽👍🏽
I'm very in favor of explicitly excluding "fault tolerance" to keep the scope as limited as possible.