Skip to content

feat: add datadog-error-monitor skill#336

Draft
tofarr wants to merge 13 commits into
mainfrom
feat/datadog-error-monitor
Draft

feat: add datadog-error-monitor skill#336
tofarr wants to merge 13 commits into
mainfrom
feat/datadog-error-monitor

Conversation

@tofarr

@tofarr tofarr commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new datadog-error-monitor skill — a cron automation that polls Datadog logs every 15 minutes, maintains a self-evolving regex-based error pattern library, and triggers targeted OpenHands investigation conversations when new or spiking errors are detected.

Draft PR for review and discussion — created by an AI agent (OpenHands) on behalf of @tofarr.


How it works

Every 15 min (cron script — no LLM)
     │
     ▼
Query Datadog ──► Match against known patterns ──► Update run_history
     │
     ├── Unknown logs detected? ─────────────────────────────────────┐
     │                                                               │
     └── Any pattern spiked (count > 3× rolling baseline)? ─────────┤
                                                                     ▼
                                                            Start one OpenHands
                                                          investigation conversation
                                                                     │
                                                    ┌────────────────┘
                                                    ▼
                                          Agent categorizes unknown errors
                                          into named regex patterns, writes
                                          them back to the state file,
                                          investigates root causes in local
                                          codebases, creates PRs if confident,
                                          posts summary to Slack

Token efficiency: The cron script is 100% deterministic — zero LLM calls on quiet runs. A conversation is only started when triggered, and only one conversation runs at a time.


Files

File Purpose
SKILL.md 11-step setup workflow the agent guides users through
scripts/main.py Cron script template (657 lines, stdlib-only, customised at automation-creation time)
references/state-schema.md JSON state file schema, agent write protocol, and lifecycle notes
references/datadog-api.md Datadog API reference (auth, log search, rate limits)
references/agent-prompt-template.md Investigation prompt structure, token budget, and workspace notes
README.md User-facing overview

Key design decisions

  • Self-evolving pattern library: Starts empty. The agent builds patterns organically from the first run's uncategorized logs and writes them back to the state file. Future runs match deterministically with no LLM cost.
  • One global conversation at a time: All triggers from a given run are batched into a single investigation conversation. No new conversation starts while one is running.
  • Spike detection: current_count > mean(last 3 runs) × SPIKE_MULTIPLIER — activates only after 3+ history entries to avoid false positives on newly added patterns.
  • Multi-host git support: Repos can span GitHub, GitLab, and Bitbucket. The setup skill verifies each repo path and the corresponding token at configuration time.
  • PR guardrail: The agent prompt explicitly instructs: create a PR only if highly confident of a code-level root cause. Infrastructure/config issues get a Slack note instead.

Open questions / known gaps for review

  1. Pattern bootstrapping UX — On first run everything is uncategorized. Users may want to run a one-time manual bootstrap query to see what the first agent conversation will receive. Should the SKILL.md include an optional pre-run step that previews a sample of recent errors?

  2. min_cluster_size not yet configurable — Currently hardcoded to 1 (any single unmatched log triggers). Should this be a configurable setup parameter?

  3. Investigation conversation workspace isolation — The agent's workspace is set to the first configured repo path. For multi-repo setups this is a minor limitation. A dedicated investigations directory could be used instead. Opinions?

  4. No .plugin metadata directory — Other skills have a .plugin/ directory with plugin manifests. This PR doesn't include it yet. Should it follow the same pattern?

Screenshots

Skill appears in beta list:
image

Setup works as expected:
image

Monitors create conversations
image

Initial draft of a cron automation skill that polls Datadog logs,
maintains a regex-based error pattern library, and triggers OpenHands
investigation conversations on new or spiking errors.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions github-actions Bot added the type: feat A new feature label Jun 12, 2026
tofarr and others added 6 commits June 12, 2026 13:58
Fixes test_marketplace_includes_all_skills — every skill with
a SKILL.md must have a corresponding marketplace entry.

Co-authored-by: openhands <openhands@all-hands.dev>
Runs build-skills-catalog.mjs to include datadog-error-monitor
in the generated catalog. Fixes test_index_is_up_to_date.

Co-authored-by: openhands <openhands@all-hands.dev>
- skills/datadog-error-monitor/.plugin/plugin.json (required by test_all_marketplace_skills_have_plugin_json)
- .claude-plugin and .codex-plugin symlinks (required by test_all_marketplace_skills_have_vendor_symlinks)
- README.md catalog regenerated via sync_extensions.py catalog

Co-authored-by: openhands <openhands@all-hands.dev>
Adds automations/catalog/datadog-error-monitor.json and updates
automations/index.js so the automation appears in the agent-canvas
beta automations list alongside linear-triage, standup-digest, etc.

Co-authored-by: openhands <openhands@all-hands.dev>
- integrations/catalog/datadog.json: new HTTP integration entry
  with iconBg #632CA6 (Datadog brand purple)
- automations/catalog/datadog-error-monitor.json: add "datadog"
  as the first requiredIntegrationId so the automation card shows
  the Datadog logo rather than the Slack logo

Datadog is kind: "http" (not mcp) since the automation uses the
Datadog REST API directly via DD_API_KEY / DD_APP_KEY secrets.

Co-authored-by: openhands <openhands@all-hands.dev>
@malhotra5 malhotra5 self-requested a review June 12, 2026 21:36
tofarr and others added 6 commits June 15, 2026 06:53
…oyment correlation

Implements all discussed improvements to main.py:

* archive_stale_patterns() — patterns not seen in 30 days are moved to
  dd_monitor_{id}_archive.json (separate file, not deleted)
* Pattern schema gains first_seen, total_events, and description fields;
  total_events is incremented on every matched log event
* EXAMPLES_PER_PATTERN lowered 5 → 3
* Investigation prompt restructured into 4 tasks:
  Task 1 — Categorize unknown logs (with deduplication check against existing
           patterns before creating new ones)
  Task 2 — Correlate first_seen against git tags to surface likely deploy
           (with explicit note asking user to confirm their deployment signal)
  Task 3 — Investigate spiking patterns within a hard tool-call budget
           (INVESTIGATION_BUDGET=10); step-by-step with one permitted
           follow-up Datadog query; explicit "declare inconclusive" escape hatch
  Task 4 — Post Slack summary including inconclusive patterns

Co-authored-by: openhands <openhands@all-hands.dev>
SKILL.md (parameter table + substitution table) and
references/agent-prompt-template.md were still showing the old
default of 5 after the value was changed in main.py.

Co-authored-by: openhands <openhands@all-hands.dev>
SKILL.md content changed (EXAMPLES_PER_PATTERN doc fix);
rerun build-skills-catalog.mjs to keep index in sync.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
…ey is not masked

Without this header, /api/settings returns the llm.api_key as a redacted
placeholder. That placeholder flows into the spawned conversation payload,
causing LiteLLM to fail with "Missing credentials".

Matches the pattern already used in github-repo-monitor.

Co-authored-by: openhands <openhands@all-hands.dev>
…not permanently blocked

When a conversation fails silently (e.g. due to LiteLLM errors), it can remain
in 'running' state indefinitely. The active-conversation guard then exits early on
every subsequent run, skipping both the unknown-log and spike triggers entirely.

This adds STUCK_CONVERSATION_MINUTES = 45. Any conversation that has been in a
non-terminal state for longer than that is treated as stuck, the active slot is
cleared, and trigger evaluation proceeds normally on the same run.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: feat A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant