Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept implementation for KafkaRoller 2.0 #11020

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Jan 8, 2025

Type of change

Select the type of your PR

  • Enhancement / new feature

Description

This PR implements strimzi/proposals#103.

Nodes are categorised based on the observed states and specific actions will be performed for each category. Those actions should cause a subsequent observation to cause a state transition.

When a new reconciliation starts up, a context object is created for each node to store the state and other useful information used by the roller.

Please start from RackRolling#loop which contains the high level logic.

High level Algorithm

  1. Initialize Context for Each Node:
    Contexts are recreated in each reconciliation with an initial data. For example:

    Context: {
        nodeRef: <NodeRef from KafkaReconciler>,
        currentRoles: <Set using pod labels `strimzi.io/controller-role` and `strimzi.io/broker-role`>,
        state: UNKNOWN,
        lastTransition: <SYSTEM_TIME>,
        reason: <Result of predicate function from KafkaReconciler>,
        numRestarts: 0,
        numReconfigs: 0,
        numAttempts: 0
    }
    
  2. Transition Node States:
    Update each node's state based on information from abstracted sources, PlatformClient, RollClient and KafkaAgentClient. If failed to retrieve information, the current reconciliation immediately fails so that a new reconciliation
    is triggered.

  3. Handle NOT_READY Nodes:
    Wait for NOT_READY nodes to become READY within operationTimeoutMs.

    This is to give an opportunity for a node to become ready in case it had just been restarted. If the node is still not ready after the timeout, it will fall through to the next step to determine the action to take on it.

  4. Categorize Nodes:
    Group nodes based on their state and connectivity into the actions:

    • WAIT_FOR_READINESS: Nodes in NOT_READY state and have been already restarted (numRestarts > 0).
    • RESTART_UNHEALTHY: Nodes in NOT_RUNNING state or unresponsive to connections.
    • WAIT_FOR_LOG_RECOVERY: Nodes in RECOVERING state.
    • MAYBE_RECONFIGURE: Nodes in READY state with empty reason lists.
    • RESTART: Nodes with reasons for restart, and no previous restarts.
    • NOP: Nodes with no reasons for restart and no reconfiguration needed or has been restarted and in READY state.

    Grouping the nodes into these categories makes it clearer to take actions on the them in the specific order. Also the category and node state is not always 1:1, for example, nodes might be unresponsive despite having READY or NOT_READY state but need to be grouped together for sequential restarts. Grouping also makes it to easier to batch broker nodes for parallel restart.

  5. Wait for Log Recovery:
    Wait for WAIT_FOR_LOG_RECOVERY nodes to become READY within operationTimeoutMs. If timeout is reached and numRetries exceeds maxRetries, throw MaxAttemptsExceededException. Otherwise, increment numRetries and repeat from step 2. We do not wait for the broker to rejoin the ISR after it becomes ready because the roller's responsibility is to restart the nodes safely, not to manage inter-broker replication. Additionally, we cannot guarantee that the broker will always be able to catch up within a reasonable time frame.

  6. Restart RESTART_UNHEALTHY Nodes:
    Restart nodes considering special conditions:

    • If multiple controller nodes are NOT_RUNNING, restart them in parallel to form a quorum.

      This is to address the issue described in [KRaft] Mixed nodes cluster is not able to recover from misconfiguration affecting the whole cluster #9426.

    • If a node is in NOT_RUNNING but does not have POD_HAS_OLD_REVISION, then the node will be skipped.
    • Restart node one by one in the following order: pure controller, combined and broker.
    • Wait for restarted node's state to transition to READY within operationTimeoutMs. If timeout is reached, increment numRetries and repeat from step 2.
  7. Refine MAYBE_RECONFIGURE Nodes:
    Describe Kafka configurations via Admin API and then regroup the nodes further:

    • Nodes with dynamic config changes are added to RECONFIGURE group.
    • Nodes with non dynamic config changes are added RESTART group.
    • Nodes with no config changes are added to NOP group.
  8. Reconfigure Nodes:
    Reconfigure nodes in the RECONFIGURE group:

    • Check if numReconfigs exceeds maxReconfigAttempts. If exceeded, add a restart reason and repeat from step 2. Otherwise, continue.
    • Send incrementalAlterConfig request and increment numReconfigAttempts.
    • Wait for each node's state to transition to READY within operationTimeoutMs. If timeout is reached, repeat from step 2, otherwise continue. (We may want to add more specific checks for this, as READY state maybe not be enough).
  9. Restart Nodes:
    Nodes in RESTART group will be restarted in the following order: pure controller, combined, active controller, broker nodes.
    Pure controller and combined nodes will be restarted one by one only when they don't have an impact on the quorum health. The quorum healthcheck logic is same as the current roller's.
    Combined nodes will be restarted only when they don't have an impact on the availability as well as the quorum health.
    Broker nodes will be restarted one by one or in parallel only when they don't have an impact on the availability.
    Brokers will be restarted in parallel only when maxRestartBatchSize is greater 1 and dryRunBatchRolling is set to false. If dryRunBatchRolling is set true, the brokers will be restarted one by one but the outcome of the batching algorithm based on maxRestartBatchSize will be logged.

If nodes cannot be restarted due to not meeting the qourum health check or availability, numAttempts for them will be
incremented. If there is any node exceeds maxRetries, throw MaxAttemptsExceededException, otherwise the process will repeat from step 2.

  1. Repeat Reconciliation:
    The process will repeat until MaxAttemptsExceededException is thrown or all nodes are grouped into NOP group.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants