From 64817da9bd050b87cc82dce90c9601f1eb0132b5 Mon Sep 17 00:00:00 2001 From: prmellor Date: Wed, 11 Dec 2024 17:30:59 +0000 Subject: [PATCH 1/2] docs(mm2): considerations for active/passive disaster recovery Signed-off-by: prmellor --- .../configuring/assembly-config.adoc | 2 + .../modules/configuring/con-mm2-recovery.adoc | 44 +++++++++++++++++++ 2 files changed, 46 insertions(+) create mode 100644 documentation/modules/configuring/con-mm2-recovery.adoc diff --git a/documentation/assemblies/configuring/assembly-config.adoc b/documentation/assemblies/configuring/assembly-config.adoc index 5496095304e..fd1ed34018f 100644 --- a/documentation/assemblies/configuring/assembly-config.adoc +++ b/documentation/assemblies/configuring/assembly-config.adoc @@ -165,6 +165,8 @@ include::../../modules/configuring/proc-manual-stop-pause-mirrormaker2-connector include::../../modules/configuring/proc-manual-restart-mirrormaker2-connector.adoc[leveloffset=+2] //Procedure to restart an MM2 connector task include::../../modules/configuring/proc-manual-restart-mirrormaker2-connector-task.adoc[leveloffset=+2] +//Disaster recovery +include::../../modules/configuring/con-mm2-recovery.adoc[leveloffset=+2] //`KafkaMirrorMaker` resource config include::../../modules/configuring/con-config-mirrormaker.adoc[leveloffset=+1] diff --git a/documentation/modules/configuring/con-mm2-recovery.adoc b/documentation/modules/configuring/con-mm2-recovery.adoc new file mode 100644 index 00000000000..4ed6f8a9319 --- /dev/null +++ b/documentation/modules/configuring/con-mm2-recovery.adoc @@ -0,0 +1,44 @@ +// This module is included in: +// +// assembly-config.adoc + +[id="con-mm2-recovery-{context}"] += Disaster recovery in an active/passive configuration + +[role="_abstract"] +MirrorMaker 2 can be configured for active/passive disaster recovery. +To support this, the Kafka cluster should also be monitored for health and performance to detect issues that detect issues that require failover promptly. + +In the event of failover, operations switch from the active cluster to the passive cluster when the active cluster becomes unavailable. + +In a worst-case scenario, the original active cluster is typically considered permanently lost. +The passive cluster is promoted to active status, taking over as the source for all application traffic. +In this state, MirrorMaker 2 stops replicating data to the original active cluster because it is no longer functional. + +Failback, or restoring operations to the original active cluster, requires careful planning. +While it is technically possible to reverse roles by swapping the source and target clusters in the MirrorMaker 2 configuration, this approach carries the risk of data duplication when records mirrored to the passive cluster are mirrored back to the original active cluster. To avoid duplicates, resetting consumer offsets is an option, but it adds further complexity. +Generally, rebuilding the original active cluster and mirroring data from the disaster recovery cluster is a simpler and more reliable approach. + +Follow these best practices for disaster recovery in the event of failure of the active cluster in an active/passive configuration: + +. Promote the passive recovery cluster to an active role. + +This minimizes downtime and ensures operations can continue. +. Redirect applications to the new active recovery cluster. + +Be aware that switching consumers to the recovery cluster may result in some message duplication. +. Recreate the failed cluster in a clean state, adhering to the original configuration. +. Deploy a new MirrorMaker 2 instance to replicate data from the active recovery cluster to the rebuilt cluster. + +Treat the rebuilt cluster as the passive cluster during this replication process. +To prevent automatic renaming of topics, configure MirrorMaker 2 to use the `IdentityReplicationPolicy` by setting the `replication.policy.class` property in the MirrorMaker 2 configuration. +With this configuration applied, topics retain their original names in the target cluster. +. Ensure the rebuilt cluster mirrors all data from the now-active recovery cluster. +. (Optional) Promote the rebuilt cluster back to active status by redirecting applications to the rebuilt cluster. + +NOTE: Before implementing any failover or failback processes, test your recovery approach in a controlled environment to minimize downtime and maintain data integrity. + + + + + + + + From 4f20fa8c6e6f61429341dc98ef618357c2819d3c Mon Sep 17 00:00:00 2001 From: prmellor Date: Thu, 12 Dec 2024 10:54:01 +0000 Subject: [PATCH 2/2] docs(review): edits to doc from comments by JS -- 01 Signed-off-by: prmellor --- .../modules/configuring/con-mm2-recovery.adoc | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/documentation/modules/configuring/con-mm2-recovery.adoc b/documentation/modules/configuring/con-mm2-recovery.adoc index 4ed6f8a9319..189aaba0f3d 100644 --- a/documentation/modules/configuring/con-mm2-recovery.adoc +++ b/documentation/modules/configuring/con-mm2-recovery.adoc @@ -7,38 +7,38 @@ [role="_abstract"] MirrorMaker 2 can be configured for active/passive disaster recovery. -To support this, the Kafka cluster should also be monitored for health and performance to detect issues that detect issues that require failover promptly. +To support this, the Kafka cluster should also be monitored for health and performance to detect issues that require failover promptly. -In the event of failover, operations switch from the active cluster to the passive cluster when the active cluster becomes unavailable. - -In a worst-case scenario, the original active cluster is typically considered permanently lost. +If failover occurs, which can be automated, operations switch from the active cluster to the passive cluster when the active cluster becomes unavailable. +The original active cluster is typically considered permanently lost. The passive cluster is promoted to active status, taking over as the source for all application traffic. -In this state, MirrorMaker 2 stops replicating data to the original active cluster because it is no longer functional. +In this state, MirrorMaker 2 no longer replicates data from the original active cluster while it remains unavailable. Failback, or restoring operations to the original active cluster, requires careful planning. -While it is technically possible to reverse roles by swapping the source and target clusters in the MirrorMaker 2 configuration, this approach carries the risk of data duplication when records mirrored to the passive cluster are mirrored back to the original active cluster. To avoid duplicates, resetting consumer offsets is an option, but it adds further complexity. -Generally, rebuilding the original active cluster and mirroring data from the disaster recovery cluster is a simpler and more reliable approach. + +It is technically possible to reverse roles in MirrorMaker 2 by swapping the source and target clusters and deploying this configuration as a new instance. +However, this approach risks data duplication, as records mirrored to the passive cluster may be mirrored back to the original active cluster. +Avoiding duplicates requires resetting consumer offsets, which adds complexity. +For a simpler and more reliable failback process, rebuild the original active cluster in a clean state and mirror data from the disaster recovery cluster. Follow these best practices for disaster recovery in the event of failure of the active cluster in an active/passive configuration: . Promote the passive recovery cluster to an active role. + +Designate the passive cluster as the active cluster for all client connections. This minimizes downtime and ensures operations can continue. . Redirect applications to the new active recovery cluster. + -Be aware that switching consumers to the recovery cluster may result in some message duplication. -. Recreate the failed cluster in a clean state, adhering to the original configuration. +MirrorMaker 2 synchronizes committed offsets to passive clusters, allowing consumer applications to resume from the last transferred offset when switching to the recovery cluster. +However, because of the time lag in offset synchronization, switching consumers may result in some message duplication. +To minimize duplication, switch all members of a consumer group together as soon as possible. +Keeping the group intact minimizes the chance of a consumer processing duplicate messages. +. Remove the MirrorMaker 2 configuration for replication from the original active cluster to the passive cluster. + +After failover, the original configuration is no longer needed and should be removed to avoid conflicts. +. Re-create the failed cluster in a clean state, adhering to the original configuration. . Deploy a new MirrorMaker 2 instance to replicate data from the active recovery cluster to the rebuilt cluster. + Treat the rebuilt cluster as the passive cluster during this replication process. To prevent automatic renaming of topics, configure MirrorMaker 2 to use the `IdentityReplicationPolicy` by setting the `replication.policy.class` property in the MirrorMaker 2 configuration. With this configuration applied, topics retain their original names in the target cluster. . Ensure the rebuilt cluster mirrors all data from the now-active recovery cluster. -. (Optional) Promote the rebuilt cluster back to active status by redirecting applications to the rebuilt cluster. - -NOTE: Before implementing any failover or failback processes, test your recovery approach in a controlled environment to minimize downtime and maintain data integrity. - - - - - - - +. (Optional) Promote the rebuilt cluster back to active status by redirecting applications to the rebuilt cluster, after ensuring it is fully synchronized with the active cluster. +NOTE: Before implementing any failover or failback processes, test your recovery approach in a controlled environment to minimize downtime and maintain data integrity. \ No newline at end of file