Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/cloud/high-availability/failovers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ At any time only the primary or the replica is active.
The only exception occurs in the event of a [network partition](https://en.wikipedia.org/wiki/Network_partition), when a Network splits into separate subnetworks.
Should this occur, you can promote a replica to active status.
**Caution:** This temporarily makes both regions active.
After the network partition is resolved and communication between the isolation domains/regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active.
After the network partition is resolved and communication between the regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active.

:::tip

Expand Down Expand Up @@ -289,7 +289,7 @@ See [Returning to the primary with failbacks](#failbacks) for details on how and

After any failover, whether triggered by you or by Temporal, an event appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs.
The audit log entry for Failover uses the `"operation": "FailoverNamespace"` event.
After failover, the replica becomes active, taking over in the isolation domain or region.
After failover, the replica becomes active, taking over from the original region.

You don't need to monitor Temporal Cloud's failover response in real time.
Whenever there is a failover event, Temporal Cloud [notifies you via email](/cloud/notifications#admin-notifications)
Expand Down Expand Up @@ -405,14 +405,14 @@ Failover testing (also known as "<ToolTipTerm term="trigger testing" />)" can:

- **Validate replicated deployments**:
In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages.
In standard setups, failover testing instead works with an isolation domain.
In Same-region Replication setups, failover testing instead works with a separate cell within the same region.
This maintains high availability in mission-critical deployments.
Manual testing confirms the failover mechanism works as expected, so your system handles incidents effectively.

- **Assess replication lag**:
In multi-region deployment, monitoring [replication lag](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p99) between regions is crucial.
Check the lag before initiating a failover to avoid rolling back Workflow progress.
This is less important when using isolation domains as failover is usually instantaneous.
This is less important with Same-region Replication, as failover is usually instantaneous.
Manual testing helps you practice this critical step and understand its impact.

- **Assess recovery time**:
Expand Down
18 changes: 11 additions & 7 deletions docs/cloud/high-availability/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,12 @@ keywords:

import { ToolTipTerm, DiscoverableDisclosure, CaptionedImage } from '@site/src/components';

Temporal Cloud's High Availability features use asynchronous <ToolTipTerm term="replication"/> across multiple <ToolTipTerm term="isolation domains" /> to provide enhanced resilience and a 99.99% [SLA](/cloud/sla).
When you enable High Availability features, Temporal deploys your primary and its <ToolTipTerm term="replica"/> in separate isolation domains, giving you control over the location of both. This redundancy, combined with <ToolTipTerm term="failover"/> capability, enhances availability during outages.
Temporal keeps your Workflows running even when a Worker crashes. But what happens when a whole data center crashes? Or a region?

In the cloud, outages are commonplace. An outage can bring down a whole data center, cluster, region, or cloud provider. To be durable in the cloud, Workflows and applications must handle these outages smoothly, just like Temporal handles a Worker crash.

Temporal Cloud's High Availability features add extra reliability to Temporal Cloud Namespaces by handling cloud outages. Using asynchronous <ToolTipTerm term="replication"/> between multiple regions or cloud providers, combined with automatic outage detection and failover, High Availability keeps your Workflows running even during a cloud region outage.
This extra availability comes with an enhanced [SLA](/cloud/sla) of 99.99%, _including_ cloud provider outages.

:::tip White paper

Expand All @@ -34,7 +38,7 @@ For an in-depth guide covering everything from why you need High Availability to
Even without High Availability features, Temporal Cloud provides robust reliability and a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)) guarantee against service errors.

Each standard Temporal Namespace uses replication across three [Availability Zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZs) to ensure high availability.
An Availability Zone is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure.
An Availability Zone is akin to an isolated data center managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure.

Replication across AZs makes sure that any changes to Workflow state or History are saved in all three AZs _before_ the Temporal Service acknowledges a change back to the Client.
As a result, your standard Temporal Namespace stays operational even if one of its three AZs becomes unavailable.
Expand All @@ -44,7 +48,7 @@ However some critical use cases--such as customer-facing applications--require e

## High Availability features {#high-availability-features}

High Availability features extend Temporal Cloud's replication offering across even more disparate isolation domains:
High Availability features extend Temporal Cloud's replication across regions and cloud providers, so your Namespace keeps running even when a whole region or cloud provider goes down:

| **Deployment** | **Description** |
| --------------------------------------- | ---------------------------------------------------------- |
Expand All @@ -53,7 +57,7 @@ High Availability features extend Temporal Cloud's replication offering across e

### Key features

- **Real-time replication** — Temporal replicates your Namespace across distant isolation domains with no performance impact to your Workers or Workflows.
- **Real-time replication** — Temporal replicates your Namespace across distant regions or cloud providers with no performance impact to your Workers or Workflows.
- **Automatic failover with 20-minute RTO** — Temporal manages failover with a 20-minute [RTO](/cloud/rpo-rto). You can also [trigger failover](/cloud/high-availability/failovers) manually at any time, for example for testing.
- **Transparent DNS routing** — On failover, DNS reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) to the active region. Requests that reach the replica are forwarded to the active region automatically.
- **Sub-1-minute RPO** — In a failover during an outage, the [Recovery Point Objective](/cloud/rpo-rto) is under one minute.
Expand All @@ -63,10 +67,10 @@ High Availability features extend Temporal Cloud's replication offering across e
:::info Region availability

You can usually choose your replica region, but the replica must be on the same continent as the primary region.
This means that a few Temporal Cloud regions do not yet support Multi-region Replication and/or Multi-cloud Replication.
This means that a few Temporal Cloud regions do not yet support Multi-region Replication or Multi-cloud Replication.
See [Regions](/cloud/regions) for a full list of supported replica regions.

You can't enable both Multi-region Replication and Multi-cloud Replciation on the same Namepsace at the same time.
You can't enable both Multi-region Replication and Multi-cloud Replication on the same Namespace at the same time.

:::

Expand Down
Loading
Loading