design: multi-replica scheduling for singleton sources (aka hot standby) #31205

aljoscha · 2025-01-27T17:47:52Z

Rendered: https://github.com/aljoscha/materialize/blob/design-multi-replica-singleton-sources/doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

benesch

All makes sense to me!

benesch · 2025-01-27T21:42:03Z

doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

+
+## Non-Goals
+
+- Add a failure detection mechanism for replicas and update source scheduling


👍🏽👍🏽👍🏽

I'm very in favor of explicitly excluding "fault tolerance" to keep the scope as limited as possible.

teskje

This seems reasonable to me. Stepping away from the assumption that all replicas receive the same commands will add a bunch of complexity in the controller, but it doesn't look like we have a choice. At least I don't think the alternative will turn out any simpler.

teskje · 2025-01-28T12:00:08Z

doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

+We want to support zero-downtime ALTER on clusters. The plan for this is to
+turn on a new replica with the changed parameters and turn off the old replica
+when the new one is "sufficiently ready". This in turn requires that we are


Nit: This might be me being dumb but I at first thought this was describing some form of ALTER SOURCE. It's describing what we usually call "graceful cluster reconfiguration", so maybe it would be helpful to mention that term here?

adding a clarification!

teskje · 2025-01-28T12:08:36Z

doc/developer/design/20250127_multi_replica_scheduling_singleton_sources.md

+expected to shut down.
+
+A: This is caught by the mechanisms we already have today for making sure there
+is only one active ingestion dataflow. We need this for correctness in the fact


Suggested change

is only one active ingestion dataflow. We need this for correctness in the fact

is only one active ingestion dataflow. We need this for correctness in the face

design: multi-replica scheduling for singleton sources (aka hot standby)

6a126ef

aljoscha force-pushed the design-multi-replica-singleton-sources branch from 06a025f to 6a126ef Compare January 27, 2025 17:50

aljoscha changed the title ~~design: multi-replica scheduling for stateless sources (aka hot standby)~~ design: multi-replica scheduling for singleton sources (aka hot standby) Jan 27, 2025

benesch approved these changes Jan 27, 2025

View reviewed changes

teskje approved these changes Jan 28, 2025

View reviewed changes

aljoscha added 2 commits January 28, 2025 14:49

fixup! expand on what ALTER means

fc34409

fixup! add another Q/A section

7a666be

aljoscha force-pushed the design-multi-replica-singleton-sources branch from 20f48f8 to 7a666be Compare January 28, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design: multi-replica scheduling for singleton sources (aka hot standby) #31205

design: multi-replica scheduling for singleton sources (aka hot standby) #31205

aljoscha commented Jan 27, 2025

benesch left a comment

benesch Jan 27, 2025

teskje left a comment

teskje Jan 28, 2025

aljoscha Jan 28, 2025

teskje Jan 28, 2025


		## Non-Goals

		- Add a failure detection mechanism for replicas and update source scheduling

	is only one active ingestion dataflow. We need this for correctness in the fact
	is only one active ingestion dataflow. We need this for correctness in the face

design: multi-replica scheduling for singleton sources (aka hot standby) #31205

Are you sure you want to change the base?

design: multi-replica scheduling for singleton sources (aka hot standby) #31205

Conversation

aljoscha commented Jan 27, 2025

Motivation

Tips for reviewer

Checklist

benesch left a comment

Choose a reason for hiding this comment

benesch Jan 27, 2025

Choose a reason for hiding this comment

teskje left a comment

Choose a reason for hiding this comment

teskje Jan 28, 2025

Choose a reason for hiding this comment

aljoscha Jan 28, 2025

Choose a reason for hiding this comment

teskje Jan 28, 2025

Choose a reason for hiding this comment