Proof of concept implementation for KafkaRoller 2.0 #11020
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of change
Select the type of your PR
Description
This PR implements strimzi/proposals#103.
Nodes are categorised based on the observed states and specific actions will be performed for each category. Those actions should cause a subsequent observation to cause a state transition.
When a new reconciliation starts up, a context object is created for each node to store the state and other useful information used by the roller.
Please start from RackRolling#loop which contains the high level logic.
High level Algorithm
Initialize Context for Each Node:
Contexts are recreated in each reconciliation with an initial data. For example:
Transition Node States:
Update each node's state based on information from abstracted sources, PlatformClient, RollClient and KafkaAgentClient. If failed to retrieve information, the current reconciliation immediately fails so that a new reconciliation
is triggered.
Handle
NOT_READY
Nodes:Wait for
NOT_READY
nodes to becomeREADY
withinoperationTimeoutMs
.This is to give an opportunity for a node to become ready in case it had just been restarted. If the node is still not ready after the timeout, it will fall through to the next step to determine the action to take on it.
Categorize Nodes:
Group nodes based on their state and connectivity into the actions:
WAIT_FOR_READINESS
: Nodes inNOT_READY
state and have been already restarted (numRestarts > 0).RESTART_UNHEALTHY
: Nodes inNOT_RUNNING
state or unresponsive to connections.WAIT_FOR_LOG_RECOVERY
: Nodes inRECOVERING
state.MAYBE_RECONFIGURE
: Nodes inREADY
state with empty reason lists.RESTART
: Nodes with reasons for restart, and no previous restarts.NOP
: Nodes with no reasons for restart and no reconfiguration needed or has been restarted and inREADY
state.Grouping the nodes into these categories makes it clearer to take actions on the them in the specific order. Also the category and node state is not always 1:1, for example, nodes might be unresponsive despite having
READY
orNOT_READY
state but need to be grouped together for sequential restarts. Grouping also makes it to easier to batch broker nodes for parallel restart.Wait for Log Recovery:
Wait for
WAIT_FOR_LOG_RECOVERY
nodes to becomeREADY
withinoperationTimeoutMs
. If timeout is reached andnumRetries
exceedsmaxRetries
, throwMaxAttemptsExceededException
. Otherwise, incrementnumRetries
and repeat from step 2. We do not wait for the broker to rejoin the ISR after it becomes ready because the roller's responsibility is to restart the nodes safely, not to manage inter-broker replication. Additionally, we cannot guarantee that the broker will always be able to catch up within a reasonable time frame.Restart
RESTART_UNHEALTHY
Nodes:Restart nodes considering special conditions:
NOT_RUNNING
, restart them in parallel to form a quorum.NOT_RUNNING
but does not havePOD_HAS_OLD_REVISION
, then the node will be skipped.READY
withinoperationTimeoutMs
. If timeout is reached, incrementnumRetries
and repeat from step 2.Refine
MAYBE_RECONFIGURE
Nodes:Describe Kafka configurations via Admin API and then regroup the nodes further:
RECONFIGURE
group.RESTART
group.NOP
group.Reconfigure Nodes:
Reconfigure nodes in the
RECONFIGURE
group:numReconfigs
exceedsmaxReconfigAttempts
. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.incrementalAlterConfig
request and incrementnumReconfigAttempts
.READY
withinoperationTimeoutMs
. If timeout is reached, repeat from step 2, otherwise continue. (We may want to add more specific checks for this, asREADY
state maybe not be enough).Restart Nodes:
Nodes in
RESTART
group will be restarted in the following order: pure controller, combined, active controller, broker nodes.Pure controller and combined nodes will be restarted one by one only when they don't have an impact on the quorum health. The quorum healthcheck logic is same as the current roller's.
Combined nodes will be restarted only when they don't have an impact on the availability as well as the quorum health.
Broker nodes will be restarted one by one or in parallel only when they don't have an impact on the availability.
Brokers will be restarted in parallel only when maxRestartBatchSize is greater 1 and dryRunBatchRolling is set to false. If dryRunBatchRolling is set true, the brokers will be restarted one by one but the outcome of the batching algorithm based on maxRestartBatchSize will be logged.
If nodes cannot be restarted due to not meeting the qourum health check or availability, numAttempts for them will be
incremented. If there is any node exceeds
maxRetries
, throwMaxAttemptsExceededException
, otherwise the process will repeat from step 2.The process will repeat until MaxAttemptsExceededException is thrown or all nodes are grouped into
NOP
group.Checklist
Please go through this checklist and make sure all applicable tasks have been done