Skip to content

Add Cloud metrics Quickstart and clarify integration setup#4577

Open
dustin-temporal wants to merge 4 commits into
mainfrom
docs/cloud-metrics-quickstart-and-fixes
Open

Add Cloud metrics Quickstart and clarify integration setup#4577
dustin-temporal wants to merge 4 commits into
mainfrom
docs/cloud-metrics-quickstart-and-fixes

Conversation

@dustin-temporal
Copy link
Copy Markdown
Contributor

@dustin-temporal dustin-temporal commented May 15, 2026

Summary

Updates /cloud/metrics/ and /cloud/metrics/openmetrics/ based on feedback from a PM hands-on testing session with the OpenMetrics endpoint and Datadog/Grafana integrations.

Changes:

  • New Quickstart at the top of /cloud/metrics/openmetrics/ with four numbered steps (create Service Account → generate API key → verify endpoint → configure tool).
  • Callout that metrics.temporal.io is a scrape endpoint, not a browser URL. Explains the Jwt is missing error testers hit when they opened it directly.
  • Role requirement spelled out: granting the Metrics Read-Only role requires Account Owner or Global Admin (Namespace Admin is not sufficient).
  • /cloud/metrics/ root page now links directly to the Quickstart from the top, so first-time users reach actionable steps in one click.
  • Grafana Cloud section in metrics-integrations.mdx now documents the API key field and the metrics.temporal.io allowed-hosts step inline, instead of bouncing users to Grafana docs that don't cover it.
  • Datadog section now includes a direct link to https://app.datadoghq.com/integrations and concrete setup steps (search the catalog, paste API key, optional namespace filter).
  • API key authentication section now points to the new Quickstart instead of the migration guide. New users shouldn't be reading a v0-to-v1 migration doc to get started.

Why

PMs hands-on-tested the full flow and surfaced these issues:

  • "The metrics docs do a good job explaining the different concepts, but could be improved to become more outcome oriented. My experience would be streamlined if I just had to do a step by step guide to create service account, get API Key, configure that in Grafana, etc."
  • "I followed the docs and ended up in https://metrics.temporal.io/, but the page just shows 'Jwt is missing' instead of redirecting me to an OAuth flow. I think the docs could clarify this isn't for opening with the browser."
  • "The Grafana integration redirects the user back and forth between Temporal Docs and Grafana Docs. But none of them tell you how to configure the API key and allowed hosts in Grafana UI."
  • "Creating a service account with API keys requires Global Admin or Account owner role permissions. Creating a service account api key as a namespace admin does not allow access to the 'Metrics-read only' permission."
  • "Couldn't find the integration on the datadog docs page (need to add a link to this https://app.datadoghq.com/integrations)."
  • "One of the pages talks about creating a service account, but that takes me to the Migration subpage. As a new-to-metrics user, I don't think of the Migration page as something I need to look at."

Out of scope (filed separately or deferred)

  • The URL inconsistency between https://metrics.temporal.io/ (docs) and https://metrics.temporal.io/v1/metrics (Cloud UI) - deferred until we canonicalize one form.
  • The "Future pricing may apply..." tip on the OpenMetrics page - reviewer didn't understand it. Will follow up separately on what it should say.

Checklist

  • Follows STYLE.md guidelines (admonition syntax matches existing pages)
  • Frontmatter unchanged
  • Internal links use slugs (/cloud/metrics/openmetrics, #quickstart, #prometheus-grafana)
  • Added explicit {#prometheus-grafana} anchor on the heading the Quickstart links to
  • Local yarn build (skipped here; opening as draft for review)

┆Attachments: EDU-6372 Add Cloud metrics Quickstart and clarify integration setup

- Add a top-of-page Quickstart to /cloud/metrics/openmetrics/ that walks users
  through Service Account creation, API key generation, endpoint verification,
  and tool selection in 4 numbered steps. Calls out the Account Owner / Global
  Admin requirement for granting the Metrics Read-Only role.
- Add a callout explaining that metrics.temporal.io is for scrapers, not
  browsers, since visiting the URL directly returns 'Jwt is missing'.
- Point /cloud/metrics/ root page directly at the Quickstart so new users
  reach actionable steps in one click instead of two.
- Inline the Grafana Cloud setup steps (API key, allowed hosts) so users no
  longer ping-pong between Temporal and Grafana docs to find the basics.
- Add a direct deep link to https://app.datadoghq.com/integrations and the
  Datadog setup steps (search integration, paste API key, optional namespace
  filter) so users do not have to hunt for the integration tile in DD.
- Update the API key authentication section to link to the new Quickstart
  instead of the migration guide, so first-time users no longer have to read
  a v0-to-v1 migration document to get started.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
temporal-documentation Ready Ready Preview, Comment May 15, 2026 8:23pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

📖 Docs PR preview links

ClickStack and New Relic previously had only a one-line description and a
link to the vendor's docs. Add concrete numbered steps so users can configure
both without bouncing between docs sites, matching the pattern now used for
Datadog and Grafana Cloud.

- ClickStack: keep existing intro paragraph; add 5 steps covering the
  temporal.key file, OTel collector config, Docker Compose mounts, HyperDX
  verification, and pre-built dashboard import.
- New Relic: add 4 steps covering infrastructure-agent install, nri-flex
  config file placement and API key substitution, agent restart, and
  pre-built dashboard install. Includes a callout noting the integration
  needs a host running the New Relic infrastructure agent.
Based on feedback from Kevin Woo: the existing Detecting Resource Exhaustion
section covers the concept but is hard to reach from the metrics reference,
and it does not make clear that account-limit throttling is the more important
signal to monitor.

- Rewrite the Detecting Resource Exhaustion section in service-health.mdx to
  explain that exhaustion is a burst signal (gracefully retried), call out
  the operation label as the investigation hook now that resource_exhausted_cause
  is gone, and cross-link to Monitoring Trends Against Limits.
- Add a lead-in paragraph to Monitoring Trends Against Limits explaining
  it is the more important throttling signal and contrasting it with
  resource exhaustion.
- Update the temporal_cloud_v1_resource_exhausted_error_count entry in the
  metrics reference to point users at both the throttle metrics and the
  service-health guidance.
- Add a one-line pointer to Monitoring Trends Against Limits from the three
  throttle metric entries (service_request_throttled_count,
  total_action_throttled_count, operations_throttled_count) so a user who
  first sees throttling in a dashboard can reach the alerting guidance.
@dustin-temporal dustin-temporal marked this pull request as ready for review May 15, 2026 20:42
@dustin-temporal dustin-temporal requested a review from a team as a code owner May 15, 2026 20:42
Copy link
Copy Markdown
Member

@kevinawoo kevinawoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding the resource exhausted vs hitting limits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants