Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/cloud/metrics/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ Each source provides different levels of granularity, filtering options, monitor

When used together, Cloud and SDK metrics measure the health and performance of your full Temporal infrastructure, including the Temporal Cloud Service and user-supplied Temporal Workers.

:::tip New to Cloud metrics?

Start with the [OpenMetrics Quickstart](/cloud/metrics/openmetrics#quickstart) to create a Service Account, generate an API key, and stream metrics into Datadog, Grafana Cloud, New Relic, ClickStack, or self-hosted Prometheus in about 5 minutes.

:::

## Cloud Metrics

Cloud metrics for all Namespaces in your account are available from the [OpenMetrics endpoint](/cloud/metrics/openmetrics), a Prometheus-compatible scrapable endpoint at `metrics.temporal.io`.
Expand Down
49 changes: 47 additions & 2 deletions docs/cloud/metrics/openmetrics/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,56 @@ Future pricing may apply to high-volume usage that exceeds standard [limits](/cl

Temporal Cloud's [OpenMetrics](https://openmetrics.io/) endpoint provides operational metrics for your Temporal Cloud workloads in industry-standard Prometheus format, enabling comprehensive monitoring across Namespaces, Workflows, and Task Queues with your existing observability stack.

## Quickstart

Stream metrics from Temporal Cloud into your observability tool in about 5 minutes.

**Prerequisites**

- An **Account Owner** or **Global Admin** role on the Temporal Cloud account. The Metrics Read-Only role is an account-level role and can only be granted by these roles. A Namespace Admin cannot complete these steps.
- An account in the observability tool you want to use (Datadog, Grafana Cloud, New Relic, ClickStack, self-hosted Prometheus, etc.).

**Steps**

1. **Create a Service Account with the Metrics Read-Only role.**

In the Temporal Cloud UI, go to **Settings → Service Accounts → Create Service Account** and assign the **Metrics Read-Only** account-level role.

2. **Generate an API key for the Service Account.**

Open the Service Account and create an API key. Copy the key and store it somewhere secure. It is shown only once.

3. **Verify the endpoint is reachable.**

```shell
curl -H "Authorization: Bearer <API_KEY>" https://metrics.temporal.io/v1/metrics
```

You should see OpenMetrics-formatted output beginning with `# TYPE temporal_cloud_v1_...`.

:::note `metrics.temporal.io` is for scrapers, not browsers

The endpoint requires an `Authorization: Bearer <API key>` header on every request. There is no browser UI. Opening `https://metrics.temporal.io` or `https://metrics.temporal.io/v1/metrics` directly in a browser returns `Jwt is missing`. Configure the endpoint inside your observability tool instead.

:::

4. **Configure your observability tool.**

Paste the API key into the integration for your tool of choice. See [Metrics integrations](/cloud/metrics/openmetrics/metrics-integrations) for tool-specific setup:

- [Datadog](/cloud/metrics/openmetrics/metrics-integrations#datadog)
- [Grafana Cloud](/cloud/metrics/openmetrics/metrics-integrations#grafana-cloud)
- [New Relic](/cloud/metrics/openmetrics/metrics-integrations#new-relic)
- [ClickStack](/cloud/metrics/openmetrics/metrics-integrations#clickstack)
- [Self-hosted Prometheus or OpenTelemetry Collector](/cloud/metrics/openmetrics/metrics-integrations#prometheus-grafana)

Metrics begin populating in your tool within a few minutes.

## Quick Links
* [Integrations](/cloud/metrics/openmetrics/metrics-integrations) - Get started exporting metrics with common integrations
* [API Documentation](/cloud/metrics/openmetrics/api-reference) - Endpoint specification and advanced configuration
* [Metrics Reference](/cloud/metrics/openmetrics/metrics-reference) - Complete catalog of all metrics with descriptions and labels
* [Migration Guide](/cloud/metrics/openmetrics/migration-guide) - Guide on how to transition from the Prometheus query endpoint
* [Migration Guide](/cloud/metrics/openmetrics/migration-guide) - Transition from the deprecated Prometheus query endpoint to OpenMetrics

## Overview
Temporal Cloud OpenMetrics exposes 50+ metrics covering workflow lifecycles, task queue operations, service performance, and system limits. All metrics are aggregated over one-minute windows and available for scraping within three minutes. Each scrape returns only the most recently completed one-minute window—configure your monitoring system to retain what it scrapes.
Expand All @@ -45,7 +90,7 @@ Temporal Cloud OpenMetrics exposes 50+ metrics covering workflow lifecycles, tas
* Teams using the query endpoint should review the [migration guide](/cloud/metrics/openmetrics/migration-guide).

## API key authentication
Create a [service account](/cloud/metrics/openmetrics/migration-guide#create-an-api-key) with the "Metrics Read-Only" role, generate an API key, and start scraping immediately - no certificate rotation or distribution required.
Create a Service Account with the "Metrics Read-Only" role and generate an API key. See the [Quickstart](#quickstart) above for step-by-step instructions. API keys work with standard HTTPS, with no certificate rotation or distribution required.

## Global endpoint
This is a single endpoint at `metrics.temporal.io` which serves all metrics across your entire account with API key authentication and standard HTTPS.
Expand Down
60 changes: 54 additions & 6 deletions docs/cloud/metrics/openmetrics/metrics-integrations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,26 +30,74 @@ This document is for basic configuration only. For advanced concepts such as lab

## Integrations

Before configuring any integration, complete the [Quickstart](/cloud/metrics/openmetrics#quickstart) to create a Service Account with the **Metrics Read-Only** role and generate an API key. This requires the **Account Owner** or **Global Admin** role - a Namespace Admin cannot grant the Metrics Read-Only role.

### Datadog

Datadog provides a serverless integration with the OpenMetrics endpoint. This integration will scrape metrics, store them in Datadog, and provides a default dashboard with some built in monitors. See the [integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/) for more details.
Datadog provides a serverless integration with the OpenMetrics endpoint. It scrapes metrics, stores them in Datadog, and ships a default dashboard with built-in monitors.

1. In Datadog, open the [Integrations catalog](https://app.datadoghq.com/integrations) and search for **Temporal Cloud OpenMetrics**. Install the integration.
2. Click **Add Account** in the integration tile and paste your Temporal Cloud API key into the **API Key** field.
3. Save the configuration. The default Temporal Cloud dashboard appears in **Dashboards → Dashboards List** once data starts flowing (typically within a few minutes).

For Datadog-side details, see the [Datadog integration page](https://docs.datadoghq.com/integrations/temporal-cloud-openmetrics/).

### Grafana Cloud

Grafana provides a serverless integration with the OpenMetrics endpoint for Grafana Cloud. This integration will scrape metrics, store them in Grafana Cloud, and provides a default dashboard
for visualizing the metrics in Grafana Cloud. See the [integration page](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/integrations/integration-reference/integration-temporal/)
for more details.
Grafana Cloud provides a serverless integration with the OpenMetrics endpoint. It scrapes metrics, stores them in Grafana Cloud, and ships a default dashboard for visualizing them.

1. In Grafana Cloud, go to **Connections → Add new connection** and search for **Temporal Cloud**.
2. On the integration page, paste your Temporal Cloud API key into the **API Key** field.
3. Add `metrics.temporal.io` to **Allowed hosts** so Grafana Cloud can reach the endpoint.
4. Click **Install** to enable the integration and import the pre-built dashboard.

If the dashboard shows no data after a few minutes, confirm the API key's Service Account has the **Metrics Read-Only** role and that the endpoint is reachable using the `curl` check from the [Quickstart](/cloud/metrics/openmetrics#quickstart).

For Grafana-side details, see the [Grafana Cloud integration page](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/integrations/integration-reference/integration-temporal/).

### ClickStack

ClickHouse provides an integration with the OpenMetrics endpoint for ClickStack. This integration uses an OpenTelemetry collector to read from the OpenMetrics endpoint, ingest data into ClickHouse, and
includes a default dashboard to visualize the data with HyperDX. See the [integration page](https://clickhouse.com/docs/use-cases/observability/clickstack/integrations/temporal-metrics) for more details.

1. Save your Temporal Cloud API key to a local file named `temporal.key` (no trailing newline or spaces).
2. Create an OpenTelemetry collector config named `temporal-metrics.yaml` that uses a Prometheus receiver against `metrics.temporal.io` with Bearer token auth, a 60-second scrape interval, the `service.name: "temporal"` resource attribute, and the ClickHouse exporter. Copy the full template from the [ClickStack integration page](https://clickhouse.com/docs/use-cases/observability/clickstack/integrations/temporal-metrics).
3. Mount both files into your ClickStack collector and set the custom config env var. With Docker Compose:

```yaml
volumes:
- ./temporal-metrics.yaml:/etc/otelcol-contrib/custom.config.yaml
- ./temporal.key:/etc/otelcol-contrib/temporal.key
environment:
CUSTOM_OTELCOL_CONFIG_FILE: /etc/otelcol-contrib/custom.config.yaml
```

4. In HyperDX, open the **Metrics explorer** and confirm metrics with the `temporal` prefix are arriving.
5. Import the pre-built dashboard: in HyperDX click **Import Dashboard**, upload `temporal-metrics-dashboard.json` from the ClickStack integration page, then click **Finish Import**.

### New Relic

New Relic integrates with Temporal Cloud via the infrastructure agent using a flex integration that pulls data from the OpenMetrics endpoint. See the [integration page](https://docs.newrelic.com/docs/infrastructure/host-integrations/host-integrations-list/temporal-cloud-integration/) for more details.
The New Relic integration pulls metrics from the OpenMetrics endpoint via the `nri-flex` integration that runs alongside the New Relic infrastructure agent.

:::note Requires a host

The integration runs on a host (Linux, Windows, or Kubernetes) with the New Relic infrastructure agent installed. The agent scrapes the endpoint and forwards metrics to New Relic.

:::

1. Install the **New Relic infrastructure agent** on a host. See the [agent install docs](https://docs.newrelic.com/docs/infrastructure/install-infrastructure-agent/get-started/install-infrastructure-agent/) for platform-specific instructions.
2. Create `/etc/newrelic-infra/integrations.d/nri-flex-temporal-cloud-config.yml` using the template from the [New Relic integration page](https://docs.newrelic.com/docs/infrastructure/host-integrations/host-integrations-list/temporal-cloud-integration/), and replace the `${TEMPORAL_API_KEY}` placeholder with your Temporal Cloud API key.
3. Restart the agent so the new config is picked up:

```shell
sudo systemctl restart newrelic-infra.service
```

4. In **one.newrelic.com**, go to **Integrations & Agents → Dashboards**, search for **Temporal Cloud**, and install the pre-built dashboard. Data appears within a few minutes.

For New Relic-side details, see the [New Relic integration page](https://docs.newrelic.com/docs/infrastructure/host-integrations/host-integrations-list/temporal-cloud-integration/).

### Prometheus \+ Grafana
### Prometheus \+ Grafana {#prometheus-grafana}

Self hosted Prometheus can be used to scrape the OpenMetrics endpoint.

Expand Down
10 changes: 6 additions & 4 deletions docs/cloud/metrics/openmetrics/metrics-reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ gRPC requests received per second.

#### temporal\_cloud\_v1\_service\_request\_throttled\_count

gRPC requests throttled per second.
gRPC requests throttled per second. See [Monitoring Trends Against Limits](/cloud/service-health#rps-aps-rate-limits) for guidance on setting alert thresholds against the corresponding limit metric.

| Label | Description |
| ----- | ----- |
Expand Down Expand Up @@ -124,7 +124,9 @@ The number of pollers that are actively long polling for a task. Use this to tra

#### temporal\_cloud\_v1\_resource\_exhausted\_error\_count

Resource exhaustion errors per second. This metric does not include throttling due to Namespace limits.
Resource exhaustion errors per second, incremented when a single resource receives a burst larger than it can absorb. SDKs retry these errors gracefully. This metric does not include throttling due to Namespace limits - see [`temporal_cloud_v1_total_action_throttled_count`](#temporal_cloud_v1_total_action_throttled_count) and related throttle metrics for rate limiting against account limits.

See [Detecting Resource Exhaustion](/cloud/service-health#detecting-resource-exhaustion) for guidance on investigating non-zero values.

| Label | Description |
| ----- | ----- |
Expand Down Expand Up @@ -633,7 +635,7 @@ This metric could have high cardinality depending on number of action types and

#### temporal\_cloud\_v1\_total\_action\_throttled\_count

The total number of actions throttled per second.
The total number of actions throttled per second. See [Monitoring Trends Against Limits](/cloud/service-health#rps-aps-rate-limits) for guidance on setting alert thresholds against the corresponding limit metric.

**Type**: Rate

Expand All @@ -651,7 +653,7 @@ Operations performed per second.

#### temporal\_cloud\_v1\_operations\_throttled\_count

Operations throttled due to rate limits per second.
Operations throttled due to rate limits per second. See [Monitoring Trends Against Limits](/cloud/service-health#rps-aps-rate-limits) for guidance on setting alert thresholds against the corresponding limit metric.

| Label | Description |
| ----- | ----- |
Expand Down
9 changes: 6 additions & 3 deletions docs/cloud/service-health.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -146,13 +146,16 @@ See [operations and metrics](/cloud/high-availability) for Namespaces with High

## Detecting Resource Exhaustion

The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` is the primary indicator for Cloud-side throttling, signaling system limits
are exceeded and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized.
Resource exhaustion happens when a single resource (a Namespace, Task Queue, or Workflow ID) receives a burst of operations larger than that resource can absorb in the moment. The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` increments and `ResourceExhausted` gRPC errors are returned to the client. SDKs retry these errors gracefully, so workflow progress is rarely impacted.

Persistent non-zero values of this metric are unexpected.
Persistent non-zero values are unexpected and indicate a hot resource. Use the `operation` label to identify which RPC is hitting the burst limit. For example, `StartWorkflowExecution` increments here when the same Workflow ID is started more than once per second.

Resource exhaustion is distinct from rate limiting against your account limits. For workloads that are throttled because they exceed their provisioned capacity, see [Monitoring Trends Against Limits](#rps-aps-rate-limits). Limits-driven throttling slows or stalls a workload, so it is generally the more important signal to monitor.

## Monitoring Trends Against Limits {#rps-aps-rate-limits}

Tracking trends against your account limits is the most important throttling signal to monitor. Unlike [Resource Exhaustion](#detecting-resource-exhaustion), which usually self-heals through retries, hitting a limit slows or stalls progress until the workload backs off or your capacity is increased.

The set of [limit metrics](/cloud/metrics/openmetrics/metrics-reference#limit-metrics) provide a time series of values for limits. Use these
metrics with their corresponding count metrics to monitor general trends against limits and set alerts when limits are exceeded. Use the corresponding throttle metrics
to determine the severity of any active rate limiting.
Expand Down
Loading