Skip to content

Commit

Permalink
EDU-3861: HAN prototyping
Browse files Browse the repository at this point in the history
  • Loading branch information
fairlydurable committed Jan 31, 2025
1 parent 58fdaf7 commit ee33f8b
Show file tree
Hide file tree
Showing 8 changed files with 822 additions and 0 deletions.
84 changes: 84 additions & 0 deletions docs/production-deployment/cloud/high-availability/enable.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
id: enable
title: Enable high availability
sidebar_label: Enable high availability
slug: /cloud/high-availability/choosing-high-availability
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---
import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::

You can enable the high-availability Namespace feature for your existing Namespace by [adding a second zone](#add-zones) to your Namespace.
After adding the second zone, Temporal Cloud begins data replication for your new standby replica.
Temporal Cloud notifies you once the replication has caught up and both Namespace zones are in sync.

**Advantages of using a high-availability Namespace:**

- No manual deployment or configuration needed, just simple push-button operation.
- Open Workflows continue in the standby region with minimal interruption and data loss.
- No changes needed for Worker and Workflow code during setup or failover.
- 99.99% Contractual SLA.

## Upgrade an existing single-zone Namespace for high-availability functionality {#add-zones}

You can upgrade existing ssingle-zone Namespace for high-availability by adding a standby zone.
The following sections show you how.

<div style={{backgroundColor: '#ffff00',padding: '0px 15px',borderRadius: '5px',border: '1px solid #cccccc',display: 'inline-block'}}>**The following material has not been audited for MRN/HAN**</div>

#### Temporal Cloud Web UI

To upgrade an existing Namespace to a multi-region Namespace:

1. Visit Temporal Cloud [Namespaces](https://cloud.temporal.io/namespaces) in your Web browser
1. Navigate to the Namespace details page
1. Select the “Add a region” button.
1. Select the standby region you want to add to this Namespace

You will see an estimated time for replication.
This time is based on your selection and the size and scale of Workflows in your Namespace,
An email alert is sent once your multi-region Namespace is ready for use.

#### Temporal 'tcld' CLI

At the command line, enter:

```
tcld namespace add-region \
--namespace <namespace_id>.<account_id> \
--region <region>
```

Specify the region code for the new region to add.
Before pressing return, add your authentication credentials. For example, `--ca-certificate-file <path-to-pem-file>`.
An email alert is sent once your multi-region Namespace is ready for use.

### Discontinuing multi-region availability {#discontinuing}

Disabling multi-region removes the high availability and automatic failover features that provide Temporal's highest service level agreement.
To disable the feature and end charges, users must contact [Temporal Support](https://support.temporal.io) directly.
MRN-specific charges for replication will stop once this decommissioning procedure completes.

- When making your request you must let us know which region you want the Namespace to land in after removing the standby region.
- If you cease services in the middle of the month, your Namespace will be converted to a single region Namespace within 1 business day.
- Temporal won't retain replicated data in the standby region once multi-region has been disabled.
- After disabling multi-region, Temporal Cloud cannot re-enable the feature for a given Namespace for seven days.
29 changes: 29 additions & 0 deletions docs/production-deployment/cloud/high-availability/faq.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
id: faq
title: Frequently Asked Questions
sidebar_label: Frequently Asked Questions
slug: /cloud/high-availability/faq
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---
import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::

117 changes: 117 additions & 0 deletions docs/production-deployment/cloud/high-availability/how-it-works.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
id: how-it-works
title: How it works
sidebar_label: How it works
slug: /cloud/high-availability/how-it-works
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---
import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::

In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency.
In contrast, with a Temporal Cloud high-availability Namespace, only the active zone accepts requests and writes at any given time.
Workflow history events are written to the active zone first and then asynchronously replicated to the standby zone replica, ensuring that the replica remains in sync.

<div style={{backgroundColor: '#ffff00',padding: '0px 15px',borderRadius: '5px',border: '1px solid #cccccc',display: 'inline-block'}}>**Needs new images**</div>

| Before failover | After failover |
| :-------------------------------------------------------: | :-----------------------------------------------------: |
| ![Before failover](/img/multi-region/before-failover.png) | ![After failover](/img/multi-region/after-failover.png) |

## Failovers {#failovers}

A failover shifts Workflow Execution processing from an active Temporal Namespace region to a standby Temporal Namespace region during outages or other incidents.
Standby Namespace regions use replication to duplicate data and prevent data loss during failover.

**What happens during the failover process?**

Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a multi-region Namespace.
The failover shifts Workflow processing to a standby region that isn’t affected by the incident.
This lets existing Workflows continue and new Workflows start while the incident is fixed.
Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original region.


:::info

You can test the failover of your multi-region Namespace by manually [triggering a failover](/cloud/multi-region#triggering-failovers) using the UI page or the 'tcld' CLI utility.
In most scenarios, we recommend you let Temporal handle failovers for you.

:::

## Health Checks {#healthchecks}

**How does Temporal detect failover conditions?**

Temporal Cloud automates failovers by performing internal health checks.
This process monitors your request error rates, latencies, and any infrastructure issues that might cause service disruptions, such as request timeouts.
It automatically triggers failovers when these indicators exceed our allowed thresholds.

### Replication lag {#replication-lag}

Multi-region Namespaces use asynchronous replication between regions.
Workflow updates in the active region, along with associated history events, are transmitted to the standby region with a short delay.
This delay is called the replication lag.
Temporal Cloud strives to maintain a P95 replication delay of less than 1 minute.
In this context, P95 means 95% of requests are processed faster than this specified limit.

Replication lags mean a [forced failover](/cloud/multi-region#forced-failover) may cause Workflows to rollback in progress.
Lags may also cause recently started Workflows to be temporarily unavailable until the active region recovers.
Temporal event versioning and [conflict resolution mechanisms](/cloud/multi-region#conflict-resolution) help guarantee that the Workflow Event History can be replayed.
Critical operations like Signals won't get lost.

### Failover scenarios

The Temporal Cloud failover mechanism supports several modes to execute Namespace failovers.
These modes include graceful failover ("handover"), forced failover, and a hybrid mode.
The hybrid mode is Temporal Cloud’s default Namespace behavior.

#### Graceful failover (handover) {#graceful-failover}

In this mode, replication tasks are fully processed and drained.
Temporal Cloud pauses traffic to the Namespace before the failover.
This prevents the loss of progress and avoids data conflicts.
The Namespace experiences a short period of unavailability, defaulting to 10 seconds.

During this period, existing Workflows stop progress.
Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.
State transitions will not happen and tasks are not dispatched.
User requests like start/signal workflow will be rejected while operations are paused during handover.

This mode favors _consistency_ over availability.

#### Forced failover {#forced-failover}

In this mode, a Namespace immediately activates in the standby region.
Events not replicated due to [replication lag](/cloud/multi-region#replication-lag) will undergo [conflict resolution](/cloud/multi-region#conflict-resolution) upon reaching the new active region.

This mode prioritizes _availability_ over consistency.

#### Hybrid failover mode {#hybrid-failover}

While graceful failovers are preferred for consistency, they aren’t always practical.
Temporal Cloud’s hybrid failover mode (the default mode) limits an initial graceful failover attempt to 10 seconds or less.
During this period, existing Workflows stop progress.
Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.
If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover.
This strategy balances consistency and availability requirements.

See the sections on [triggering a failover](/cloud/multi-region#triggering-failovers), [Worker deployment](/cloud/multi-region#worker-deployment), and [routing](/cloud/multi-region#routing) for more information.
87 changes: 87 additions & 0 deletions docs/production-deployment/cloud/high-availability/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
id: index
title: High-availability Namespaces
sidebar_label: High-availability Namespaces
slug: /cloud/high-availability
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high-availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---

import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::

Temporal Cloud's high-availability Namespaces provide disaster-tolerant deployment for workloads where availability is critical to your operations.
When you enable high availability, Temporal Cloud automatically synchronizes your data between a primary and a fallback Namespace, keeping them in sync.
Should an incident occur, Temporal will [failover](/glossary#failover) your Namespace.
This allows your Workflow Executions and Schedules to seamlessly shift from the active availability zone to the fallback availability zone.

## Availability zones and replicas

An availability zone is a physically isolated data center within a deployment region for a given cloud provider.
Regions consist of multiple availability zones, providing redundancy and fault tolerance.
In some cases, the fallback zone may be in the same region as the primary zone, or it may be in a different region altogether, depending on your deployment configuration.

High-availability simplifies deployment, ensuring operational continuity and data integrity even during unexpected events.
Regional disruptions or other issues that affect the data centers within a specific availability zone may occur.
High-availability allows processing to shift from the affected zone to an already-synchronized fallback zone.

This synchronized zone is called a "replica."
The process of duplicating all Workflow data ensures that your replica, which serves as the standby region, is always available and ready to take on the active role.

In the event of network service or performance issues in the active zone, your replica is ready to take over.
When necessary, Temporal Cloud smoothly transitions control from the active to the standby zone using a process called "[failover](/glossary#failover)".

## Why choose high-availability? {#high-availability-intro}

For many organizations, ensuring high-availability is critical to maintaining business continuity.
Temporal Cloud's high-availability Namespace feature includes a 99.99% contractual Service Level Agreement ([SLA](https://docs.temporal.io/cloud/sla)).
It provides 99.99% availability and 99.99% guarantee against service errors.

A high-availability Namespace (HAN) creates a single logical Namespace that operates across two physical zones: one active and one standby.
HANs streamline access for both zones to a unified Namespace endpoint.
As Workflows progress in the active zone, history events are asynchronously replicated to the standby zone, ensuring continuity and data integrity.

In the event of an incident or outage in the active zone, Temporal Cloud will seamlessly failover to your standby zone.
Failovers allow existing Workflow Executions to continue running and new Workflow Executions to be started.
Once failover occurs, the roles of the active and standby zones switch.
The standby zone becomes active, and the previous active zone becomes the standby.
After the issue is resolved, the zone "fails back" from the replica to the original.

## Opting into high-availability

Should you be using high-availability Namespaces? It depends on your availability requirements:

- High-availability Namespaces offer a 99.99% contractual SLA for workloads with strict high-availability needs.
HANs use two Namespaces in two deployment zones to support standby recovery.
In the event of a zone failure, Temporal Cloud automatically fails over the HAN Namespace to the standby replica.
- Single-zone Namespaces include a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)).
In single-zone use, Temporal clients connect to a single Namespace in one deployment zone.
For many applications, this offers sufficient availability.

Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high-availability.

| **Advantages of using a multi-region Namespace** |
| ------------------------------------------------------------------------------------ |
| No manual deployment or configuration required—just simple push-button operation. |
| Open Workflows continue in the standby zone with minimal interruption and data loss. |
| No changes needed for Worker or Workflow code during setup or failover. |
| 99.99% contractual SLA. |

30 changes: 30 additions & 0 deletions docs/production-deployment/cloud/high-availability/operations.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
id: operations
title: Operations
sidebar_label: Operations
slug: /cloud/high-availability/operations
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---
import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::


29 changes: 29 additions & 0 deletions docs/production-deployment/cloud/high-availability/pricing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
id: pricing
title: Pricing (and Support?)
sidebar_label: Pricing (and Support?)
slug: /cloud/high-availability/pricing
description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
tags:
- Temporal Cloud
- Production
- High availability
keywords:
- availability
- explanation
- failover
- high-availability
- multi-region
- multi-region namespace
- namespaces
- temporal-cloud
- term
---
import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';

:::tip Support, stability, and dependency info

High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.

:::

Loading

0 comments on commit ee33f8b

Please sign in to comment.