feat: add datadog-error-monitor skill by tofarr · Pull Request #336 · OpenHands/extensions

tofarr · 2026-06-12T16:05:29Z

Summary

Adds a new datadog-error-monitor skill — a cron automation that polls Datadog logs every 15 minutes, maintains a self-evolving regex-based error pattern library, and triggers targeted OpenHands investigation conversations when new or spiking errors are detected.

Draft PR for review and discussion — created by an AI agent (OpenHands) on behalf of @tofarr.

How it works

Every 15 min (cron script — no LLM)
     │
     ▼
Query Datadog ──► Match against known patterns ──► Update run_history
     │
     ├── Unknown logs detected? ─────────────────────────────────────┐
     │                                                               │
     └── Any pattern spiked (count > 3× rolling baseline)? ─────────┤
                                                                     ▼
                                                            Start one OpenHands
                                                          investigation conversation
                                                                     │
                                                    ┌────────────────┘
                                                    ▼
                                          Agent categorizes unknown errors
                                          into named regex patterns, writes
                                          them back to the state file,
                                          investigates root causes in local
                                          codebases, creates PRs if confident,
                                          posts summary to Slack

Token efficiency: The cron script is 100% deterministic — zero LLM calls on quiet runs. A conversation is only started when triggered, and only one conversation runs at a time.

Files

File	Purpose
`SKILL.md`	11-step setup workflow the agent guides users through
`scripts/main.py`	Cron script template (657 lines, stdlib-only, customised at automation-creation time)
`references/state-schema.md`	JSON state file schema, agent write protocol, and lifecycle notes
`references/datadog-api.md`	Datadog API reference (auth, log search, rate limits)
`references/agent-prompt-template.md`	Investigation prompt structure, token budget, and workspace notes
`README.md`	User-facing overview

Key design decisions

Self-evolving pattern library: Starts empty. The agent builds patterns organically from the first run's uncategorized logs and writes them back to the state file. Future runs match deterministically with no LLM cost.
One global conversation at a time: All triggers from a given run are batched into a single investigation conversation. No new conversation starts while one is running.
Spike detection: current_count > mean(last 3 runs) × SPIKE_MULTIPLIER — activates only after 3+ history entries to avoid false positives on newly added patterns.
Multi-host git support: Repos can span GitHub, GitLab, and Bitbucket. The setup skill verifies each repo path and the corresponding token at configuration time.
PR guardrail: The agent prompt explicitly instructs: create a PR only if highly confident of a code-level root cause. Infrastructure/config issues get a Slack note instead.

Open questions / known gaps for review

Pattern bootstrapping UX — On first run everything is uncategorized. Users may want to run a one-time manual bootstrap query to see what the first agent conversation will receive. Should the SKILL.md include an optional pre-run step that previews a sample of recent errors?
min_cluster_size not yet configurable — Currently hardcoded to 1 (any single unmatched log triggers). Should this be a configurable setup parameter?
Investigation conversation workspace isolation — The agent's workspace is set to the first configured repo path. For multi-repo setups this is a minor limitation. A dedicated investigations directory could be used instead. Opinions?
No .plugin metadata directory — Other skills have a .plugin/ directory with plugin manifests. This PR doesn't include it yet. Should it follow the same pattern?

Screenshots

Skill appears in beta list:

Setup works as expected:

Monitors create conversations

Initial draft of a cron automation skill that polls Datadog logs, maintains a regex-based error pattern library, and triggers OpenHands investigation conversations on new or spiking errors. Co-authored-by: openhands <openhands@all-hands.dev>

Fixes test_marketplace_includes_all_skills — every skill with a SKILL.md must have a corresponding marketplace entry. Co-authored-by: openhands <openhands@all-hands.dev>

Runs build-skills-catalog.mjs to include datadog-error-monitor in the generated catalog. Fixes test_index_is_up_to_date. Co-authored-by: openhands <openhands@all-hands.dev>

- skills/datadog-error-monitor/.plugin/plugin.json (required by test_all_marketplace_skills_have_plugin_json) - .claude-plugin and .codex-plugin symlinks (required by test_all_marketplace_skills_have_vendor_symlinks) - README.md catalog regenerated via sync_extensions.py catalog Co-authored-by: openhands <openhands@all-hands.dev>

Adds automations/catalog/datadog-error-monitor.json and updates automations/index.js so the automation appears in the agent-canvas beta automations list alongside linear-triage, standup-digest, etc. Co-authored-by: openhands <openhands@all-hands.dev>

- integrations/catalog/datadog.json: new HTTP integration entry with iconBg #632CA6 (Datadog brand purple) - automations/catalog/datadog-error-monitor.json: add "datadog" as the first requiredIntegrationId so the automation card shows the Datadog logo rather than the Slack logo Datadog is kind: "http" (not mcp) since the automation uses the Datadog REST API directly via DD_API_KEY / DD_APP_KEY secrets. Co-authored-by: openhands <openhands@all-hands.dev>

…oyment correlation Implements all discussed improvements to main.py: * archive_stale_patterns() — patterns not seen in 30 days are moved to dd_monitor_{id}_archive.json (separate file, not deleted) * Pattern schema gains first_seen, total_events, and description fields; total_events is incremented on every matched log event * EXAMPLES_PER_PATTERN lowered 5 → 3 * Investigation prompt restructured into 4 tasks: Task 1 — Categorize unknown logs (with deduplication check against existing patterns before creating new ones) Task 2 — Correlate first_seen against git tags to surface likely deploy (with explicit note asking user to confirm their deployment signal) Task 3 — Investigate spiking patterns within a hard tool-call budget (INVESTIGATION_BUDGET=10); step-by-step with one permitted follow-up Datadog query; explicit "declare inconclusive" escape hatch Task 4 — Post Slack summary including inconclusive patterns Co-authored-by: openhands <openhands@all-hands.dev>

SKILL.md (parameter table + substitution table) and references/agent-prompt-template.md were still showing the old default of 5 after the value was changed in main.py. Co-authored-by: openhands <openhands@all-hands.dev>

SKILL.md content changed (EXAMPLES_PER_PATTERN doc fix); rerun build-skills-catalog.mjs to keep index in sync. Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

…ey is not masked Without this header, /api/settings returns the llm.api_key as a redacted placeholder. That placeholder flows into the spawned conversation payload, causing LiteLLM to fail with "Missing credentials". Matches the pattern already used in github-repo-monitor. Co-authored-by: openhands <openhands@all-hands.dev>

…not permanently blocked When a conversation fails silently (e.g. due to LiteLLM errors), it can remain in 'running' state indefinitely. The active-conversation guard then exits early on every subsequent run, skipping both the unknown-log and spike triggers entirely. This adds STUCK_CONVERSATION_MINUTES = 45. Any conversation that has been in a non-terminal state for longer than that is treated as stuck, the active slot is cleared, and trigger evaluation proceeds normally on the same run. Co-authored-by: openhands <openhands@all-hands.dev>

feat: add datadog-error-monitor skill

f6d7fbe

Initial draft of a cron automation skill that polls Datadog logs, maintains a regex-based error pattern library, and triggers OpenHands investigation conversations on new or spiking errors. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions Bot added the type: feat A new feature label Jun 12, 2026

tofarr and others added 6 commits June 12, 2026 13:58

python scripts/sync_extensions.py --check

c6b15c6

fix: add datadog-error-monitor to marketplace

b92a93a

Fixes test_marketplace_includes_all_skills — every skill with a SKILL.md must have a corresponding marketplace entry. Co-authored-by: openhands <openhands@all-hands.dev>

chore: regenerate skills/index.js

2641fd2

Runs build-skills-catalog.mjs to include datadog-error-monitor in the generated catalog. Fixes test_index_is_up_to_date. Co-authored-by: openhands <openhands@all-hands.dev>

malhotra5 self-requested a review June 12, 2026 21:36

tofarr and others added 6 commits June 15, 2026 06:53

fix: update EXAMPLES_PER_PATTERN default from 5 to 3 in docs

cbb6c27

SKILL.md (parameter table + substitution table) and references/agent-prompt-template.md were still showing the old default of 5 after the value was changed in main.py. Co-authored-by: openhands <openhands@all-hands.dev>

chore: regenerate skills/index.js

61a84fe

SKILL.md content changed (EXAMPLES_PER_PATTERN doc fix); rerun build-skills-catalog.mjs to keep index in sync. Co-authored-by: openhands <openhands@all-hands.dev>

docs: note skills/index.js regeneration rule in AGENTS.md

84a13cc

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add datadog-error-monitor skill#336

feat: add datadog-error-monitor skill#336
tofarr wants to merge 13 commits into
mainfrom
feat/datadog-error-monitor

tofarr commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tofarr commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Files

Key design decisions

Open questions / known gaps for review

Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tofarr commented Jun 12, 2026 •

edited

Loading