From aea42ac4c2cbf5173a390deaf85b25a32438ac2d Mon Sep 17 00:00:00 2001
From: Xav Paice <xav@replicated.com>
Date: Mon, 22 Jun 2026 09:46:46 +1200
Subject: [PATCH 1/5] add opencode skill for testgrid

---
 .envrc                                        |   1 +
 .gitignore                                    |   4 +-
 .opencode/.gitignore                          |   5 +
 .../skills/testgrid-failure-analysis/SKILL.md |  78 +++++
 .../skills/testgrid-failure-analysis/fetch.py | 202 +++++++++++++
 AGENTS.md                                     | 281 ++++++++++++++++++
 6 files changed, 569 insertions(+), 2 deletions(-)
 create mode 100644 .envrc
 create mode 100644 .opencode/.gitignore
 create mode 100644 .opencode/skills/testgrid-failure-analysis/SKILL.md
 create mode 100755 .opencode/skills/testgrid-failure-analysis/fetch.py
 create mode 100644 AGENTS.md

diff --git a/.envrc b/.envrc
new file mode 100644
index 0000000000..57ccbaff3c
--- /dev/null
+++ b/.envrc
@@ -0,0 +1 @@
+source .env
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index 02438d2b42..c7ddb7b545 100644
--- a/.gitignore
+++ b/.gitignore
@@ -36,11 +36,11 @@ sbom/
 .aider*
 .continue
 .codeium
-CLAUDE.md
-AGENTS.md
 .repomixignore
 
 # local workflow
 plans/
 docs/
 release-manifest.json
+
+.env
diff --git a/.opencode/.gitignore b/.opencode/.gitignore
new file mode 100644
index 0000000000..2ed394f164
--- /dev/null
+++ b/.opencode/.gitignore
@@ -0,0 +1,5 @@
+node_modules
+package.json
+package-lock.json
+bun.lock
+.gitignore
\ No newline at end of file
diff --git a/.opencode/skills/testgrid-failure-analysis/SKILL.md b/.opencode/skills/testgrid-failure-analysis/SKILL.md
new file mode 100644
index 0000000000..9746239c98
--- /dev/null
+++ b/.opencode/skills/testgrid-failure-analysis/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: testgrid-failure-analysis
+description: Use when analyzing a failed Testgrid kURL run to fetch the run results, failure logs, and encrypted support bundles from the Testgrid API and write them into a directory for offline analysis; trigger with "testgrid failure analysis", "fetch testgrid logs", "get support bundle from Testgrid", or "analyze Testgrid run".
+---
+
+# Testgrid failure analysis
+
+This skill helps an agent collect the artifacts of a failed [Testgrid](https://testgrid.kurl.sh/) run so they can be analyzed locally.
+
+## What it does
+
+1. Queries the Testgrid API for a run by `refId`.
+2. Identifies every failed instance (`isSuccess == false`, not unsupported, not skipped, and finished).
+3. For each failure, fetches:
+   - The instance metadata (`instance.json`)
+   - The main instance logs (populated when the VM fails to start)
+   - Sonobuoy results, if any
+   - The per-node logs from the actual test VMs (`{nodeId}.log.txt`)
+   - Any encrypted support bundles whose S3 URLs are printed in the node logs
+4. Writes everything into a structured output directory ready for an agent to inspect.
+
+## Important details from the codebase
+
+- Public API base path is `/api/v1`. The endpoints used are:
+  - `POST /api/v1/run/{refId}` — returns the run with its `instances` array, plus `success_count` and `failure_count`.
+  - `GET /api/v1/instance/{instanceId}/logs` — returns `{"logs": "..."}` from the `testinstance.output` column.
+  - `GET /api/v1/instance/{nodeId}/node-logs` — returns `{"logs": "..."}` from the `clusternode.output` column.
+  - `GET /api/v1/instance/{instanceId}/sonobuoy` — returns `{"results": "..."}`.
+- The open-source `/api/v1` endpoints are **not** authenticated by default (the `api-token` auth middleware only protects the runner endpoints under `/v1`). However, an optional `--api-token` is accepted and sent as HTTP Basic Auth with username `token` and the provided password, for deployments that add authentication. `--api-key` is kept as a deprecated alias for backward compatibility.
+- Support bundles are collected by the test script (`tgrun/pkg/runner/vmi/embed/runcmd.sh` → `collect_support_bundle`) and uploaded to S3 with the handler at `POST /v1/instance/{instanceId}/bundle`. The S3 URL is printed in the node log output, which is why this skill scans the logs for it.
+- The bundle is encrypted with the `age` file format using a scrypt passphrase. The API stores it with key pattern `{instanceId}-{unix}/bundle.tgz.age`. The downloaded file keeps the `.age` extension.
+- If you provide the age passphrase, the helper script will try to decrypt each bundle in place with `age -d -p`.
+
+## Node IDs used by the runner
+
+Testgrid creates one initial-primary node plus optional additional nodes. The node IDs are predictable from the instance ID and the `numPrimaryNodes` / `numSecondaryNodes` fields, so the skill tries:
+
+- `{instanceId}-initialprimary`
+- `{instanceId}-primary-1` ... `{instanceId}-primary-{numPrimaryNodes-1}`
+- `{instanceId}-secondary-0` ... `{instanceId}-secondary-{numSecondaryNodes-1}`
+
+Only nodes that actually produced logs will be saved.
+
+## How to use
+
+Run the helper script shipped with this skill:
+
+```bash
+python3 .opencode/skills/testgrid-failure-analysis/fetch.py \
+  --api-endpoint https://api.testgrid.kurl.sh \
+  --ref-id <RUN_REF_ID> \
+  --output-dir ./testgrid-analysis/<RUN_REF_ID> \
+  [--api-token <TOKEN>] \
+  [--age-passphrase <PASSPHRASE>]
+```
+
+Environment variables are also supported:
+
+- `TESTGRID_API_TOKEN` → `--api-token` (`TESTGRID_API_KEY` is still read as a fallback)
+- `TESTGRID_AGE_PASSPHRASE` → `--age-passphrase`
+
+## Output layout
+
+```
+<output-dir>/
+  run.json                    # full run response
+  <instanceId>/
+    instance.json             # instance metadata
+    logs.txt                  # main instance output, if any
+    sonobuoy.txt              # sonobuoy results, if any
+    <instanceId>-initialprimary.log.txt
+    bundle-<nodeId>-0.tgz.age # encrypted support bundle
+    bundle-<nodeId>-0.tgz     # decrypted support bundle (if passphrase supplied)
+```
+
+## What to do next
+
+After fetching, read the `run.json` summary, open the per-instance logs, and inspect any decrypted support bundles. If a bundle could not be downloaded, grep the corresponding node log for `bundle.tgz.age` to find the raw S3 URL.
diff --git a/.opencode/skills/testgrid-failure-analysis/fetch.py b/.opencode/skills/testgrid-failure-analysis/fetch.py
new file mode 100755
index 0000000000..6af39bddb4
--- /dev/null
+++ b/.opencode/skills/testgrid-failure-analysis/fetch.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""Fetch Testgrid failure logs and support bundles for analysis.
+
+Queries the public Testgrid API endpoints:
+  GET /api/v1/runs
+  POST /api/v1/run/{refId}
+  GET /api/v1/instance/{id}/logs
+  GET /api/v1/instance/{nodeId}/node-logs
+  GET /api/v1/instance/{id}/sonobuoy
+
+For every failed instance in a run, it downloads the instance logs and the per-node
+logs, scans them for encrypted support-bundle URLs, downloads the bundles, and
+optionally decrypts them with age when an age passphrase is supplied.
+"""
+
+import argparse
+import base64
+import json
+import os
+import re
+import subprocess
+import sys
+from urllib.request import Request, urlopen
+from urllib.error import HTTPError
+
+
+def api_request(url, api_key=None, data=None, method=None, timeout=60):
+    headers = {"Accept": "application/json"}
+    if api_key:
+        creds = base64.b64encode(b"token:" + api_key.encode()).decode()
+        headers["Authorization"] = f"Basic {creds}"
+    if data is not None:
+        method = method or "POST"
+        headers["Content-Type"] = "application/json"
+    req = Request(url, data=data, headers=headers, method=method)
+    with urlopen(req, timeout=timeout) as resp:
+        return resp.read().decode("utf-8")
+
+
+def download(url, path, timeout=300):
+    req = Request(url, headers={"User-Agent": "testgrid-failure-analysis"})
+    with urlopen(req, timeout=timeout) as resp:
+        with open(path, "wb") as f:
+            f.write(resp.read())
+
+
+def node_ids(instance_id, num_primary, num_secondary):
+    """Return the node IDs the runner creates for a given instance."""
+    ids = [f"{instance_id}-initialprimary"]
+    for i in range(1, max(1, num_primary)):
+        ids.append(f"{instance_id}-primary-{i}")
+    for i in range(num_secondary):
+        ids.append(f"{instance_id}-secondary-{i}")
+    return ids
+
+
+def save_json(path, obj):
+    with open(path, "w") as f:
+        json.dump(obj, f, indent=2)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Fetch Testgrid run failures, logs, and support bundles.")
+    parser.add_argument(
+        "--api-endpoint", required=True,
+        help="Testgrid API base URL, e.g. https://api.testgrid.kurl.sh")
+    parser.add_argument(
+        "--ref-id", required=True,
+        help="Testgrid run refId (the run identifier shown in the UI)")
+    parser.add_argument(
+        "--output-dir", required=True,
+        help="Directory where artifacts will be written")
+    parser.add_argument(
+        "--api-token", "--api-key", dest="api_token",
+        default=os.environ.get("TESTGRID_API_TOKEN") or os.environ.get("TESTGRID_API_KEY"),
+        help="Optional API token. If the server requires it, sent as basic-auth password with username 'token'. Reads TESTGRID_API_TOKEN (preferred) or TESTGRID_API_KEY (legacy) if not provided.")
+    parser.add_argument(
+        "--age-passphrase", default=os.environ.get("TESTGRID_AGE_PASSPHRASE"),
+        help="Optional age passphrase to decrypt downloaded .age support bundles")
+    parser.add_argument(
+        "--age-bin", default="age",
+        help="Path to the age decryption binary")
+    parser.add_argument(
+        "--page-size", type=int, default=1000,
+        help="Page size for the run query")
+    args = parser.parse_args()
+
+    base = args.api_endpoint.rstrip("/")
+    if not base.endswith("/api/v1"):
+        base = base + "/api/v1"
+
+    out_dir = args.output_dir
+    os.makedirs(out_dir, exist_ok=True)
+
+    run_url = f"{base}/run/{args.ref_id}"
+    print(f"Fetching run {args.ref_id} ...")
+    run_data = json.loads(api_request(
+        run_url, args.api_token,
+        data=json.dumps({"pageSize": args.page_size}).encode(),
+        timeout=120))
+
+    save_json(os.path.join(out_dir, "run.json"), run_data)
+
+    instances = run_data.get("instances", [])
+    failures = [
+        i for i in instances
+        if not i.get("isSuccess")
+        and not i.get("isUnsupported")
+        and not i.get("isSkipped")
+        and i.get("finishedAt")
+    ]
+
+    print(f"Found {len(failures)} failed instance(s) out of {len(instances)} total.")
+    if not failures:
+        return
+
+    bundle_re = re.compile(
+        r"https?://[^\s\"]+?\.s3\.amazonaws\.com/[^\s\"]+?bundle\.tgz\.age")
+
+    for inst in failures:
+        iid = inst["id"]
+        inst_dir = os.path.join(out_dir, iid)
+        os.makedirs(inst_dir, exist_ok=True)
+        save_json(os.path.join(inst_dir, "instance.json"), inst)
+
+        os_info = f"{inst.get('osName', '?')} {inst.get('osVersion', '?')}"
+        print(f"\n{iid} ({os_info}, reason: {inst.get('failureReason', 'unknown')})")
+
+        # Main instance output (usually only populated on VMI startup failures).
+        try:
+            txt = api_request(f"{base}/instance/{iid}/logs", args.api_token, timeout=60)
+            logs = json.loads(txt).get("logs", "")
+            if logs:
+                with open(os.path.join(inst_dir, "logs.txt"), "w") as f:
+                    f.write(logs)
+                print("  saved instance logs")
+        except HTTPError as e:
+            if e.code != 404:
+                print(f"  warning: could not fetch instance logs: {e}", file=sys.stderr)
+
+        # Sonobuoy results, if present.
+        try:
+            txt = api_request(f"{base}/instance/{iid}/sonobuoy", args.api_token, timeout=60)
+            results = json.loads(txt).get("results", "")
+            if results:
+                with open(os.path.join(inst_dir, "sonobuoy.txt"), "w") as f:
+                    f.write(results)
+                print("  saved sonobuoy results")
+        except HTTPError as e:
+            if e.code != 404:
+                print(f"  warning: could not fetch sonobuoy results: {e}", file=sys.stderr)
+
+        # Per-node logs and support bundles.
+        num_primary = inst.get("numPrimaryNodes", 1)
+        num_secondary = inst.get("numSecondaryNodes", 0)
+        for node_id in node_ids(iid, num_primary, num_secondary):
+            try:
+                txt = api_request(f"{base}/instance/{node_id}/node-logs", args.api_token, timeout=60)
+                logs = json.loads(txt).get("logs", "")
+                if not logs:
+                    continue
+
+                log_path = os.path.join(inst_dir, f"{node_id}.log.txt")
+                with open(log_path, "w") as f:
+                    f.write(logs)
+                print(f"  saved {node_id} node logs")
+
+                urls = sorted(set(bundle_re.findall(logs)))
+                for j, url in enumerate(urls):
+                    enc_path = os.path.join(inst_dir, f"bundle-{node_id}-{j}.tgz.age")
+                    try:
+                        download(url, enc_path)
+                        print(f"  downloaded support bundle -> {enc_path}")
+                    except Exception as e:
+                        print(f"  failed to download bundle {url}: {e}", file=sys.stderr)
+                        continue
+
+                    if args.age_passphrase:
+                        dec_path = os.path.join(inst_dir, f"bundle-{node_id}-{j}.tgz")
+                        try:
+                            with open(dec_path, "wb") as dec_file:
+                                subprocess.run(
+                                    [args.age_bin, "-d", "-p", enc_path],
+                                    input=args.age_passphrase.encode(),
+                                    stdout=dec_file,
+                                    check=True,
+                                    stderr=subprocess.PIPE,
+                                )
+                            print(f"  decrypted support bundle -> {dec_path}")
+                        except Exception as e:
+                            print(f"  failed to decrypt {enc_path}: {e}", file=sys.stderr)
+
+            except HTTPError as e:
+                if e.code != 404:
+                    print(f"  warning: could not fetch {node_id} node logs: {e}", file=sys.stderr)
+
+    print(f"\nArtifacts written to: {out_dir}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000000..87d1c15281
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,281 @@
+# Agent Guide for kURL
+
+This guide is for AI agents working in the kURL repository. It explains how the project is organized, how changes are made (especially add-on updates), how to test, and how the repository relates to the kURL-testgrid testing platform.
+
+## 1. Project overview
+
+**kURL** is a Kubernetes installer for air-gapped and online clusters, maintained by Replicated. It automates the tasks a cluster administrator must perform before and after running `kubeadm init` to create a production-ready Kubernetes cluster.
+
+A user posts a YAML `Installer` manifest to the kURL.sh API and receives a deterministic hash. That hash can be used to fetch an install script (`https://kurl.sh/<hash>`) or an air-gap bundle (`https://kurl.sh/bundle/<hash>.tar.gz`). The installer then downloads pre-built add-on tarballs and host packages from the `kurl-sh` S3 bucket and executes the install/upgrade on the target node.
+
+The kURL project has two main repositories:
+
+- **kURL** (`/Users/xav/go/src/github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry.
+- **kURL-testgrid** (`/Users/xav/go/src/github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`.
+
+## 2. Repository structure
+
+Key directories in this repo:
+
+| Directory | Purpose |
+|---|---|
+| `addons/<name>/<version>/` | Add-on definitions, one directory per version. Each version contains at minimum `Manifest`, `install.sh`, and usually `host-preflight.yaml`. |
+| `bin/` | Build, packaging, and release helper scripts. |
+| `bundles/` | Docker build contexts for host packages (Kubernetes RPM/DEB repos per OS). |
+| `cmd/` | Go entrypoints; `cmd/kurl` is the main CLI binary. |
+| `hack/` | Local development helpers, test data, and test Dockerfiles. |
+| `kurl_util/` | Go utilities built into binaries and the `replicated/kurl-util` Docker image. |
+| `packages/` | Host-package definitions (e.g., `kubernetes`, `host/openssl`, `host/fio`). |
+| `pkg/` | Go library code for the `kurl` CLI. |
+| `scripts/` | The bash installer source (`install.sh`, `join.sh`, `upgrade.sh`, `tasks.sh`, `common/`). |
+| `testgrid/specs/` | Testgrid YAML specs for OS images and test scenarios. |
+| `tools/` | Additional tooling. |
+| `web/src/installers/` | Frontend version registry; `versions.js` lists available add-on versions. |
+| `.github/workflows/` | CI/CD workflows. |
+
+Important documents:
+
+- `README.md` — project overview and community links.
+- `ARCHITECTURE.md` — manifest format, API services, object storage, add-on lifecycle, release workflows.
+- `CONTRIBUTING.md` — development workflow, remote testing, environment setup.
+- `addons/README.md` — add-on structure, Manifest directives, lifecycle hooks.
+- `testgrid/specs/README.md` — how Testgrid specs work.
+- `docs/arch/adr-003-external-addons.md` — external add-on model (`kotsadm`).
+- `CODEOWNERS` — global owner `@replicatedhq/embedded-kubernetes`.
+
+## 3. How to approach changes
+
+### General workflow
+
+1. **Identify the scope.** Is this a core installer change, an add-on change, a Testgrid-only change, or a version bump? Core scripts live in `scripts/`. Add-ons live in `addons/<name>/<version>/`. Testgrid specs live in `testgrid/specs/` and in add-on `template/testgrid/` directories.
+2. **Make the minimal change.** Edit only the source files. Generated files (e.g., `build/`, `dist/`, `addons-gen.json`, `supported-versions-gen.json`) are produced by `make` targets and should not be hand-edited.
+3. **Keep `scripts/Manifest` in sync with `hack/testdata/manifest/clean`.** `make test` compares these two files and fails if they differ. If you modify `scripts/Manifest`, update the test data copy as well.
+4. **Run local tests.** See the Testing section below.
+5. **Let CI run Testgrid for add-on changes.** The `test-addon-pr.yaml` workflow detects modified add-ons and queues Testgrid runs automatically.
+
+### When working on a remote Linux VM
+
+The install scripts are not macOS-compatible. For real-world testing, use `make watchrsync` after exporting:
+
+```bash
+export GOOS=linux
+export GOARCH=amd64
+export DOCKER_DEFAULT_PLATFORM=linux/amd64
+export REMOTES="USER@TARGET_SERVER_IP"
+make watchrsync
+```
+
+`bin/watchrsync.js` continuously syncs local builds to `~/kurl` on the remote server.
+
+### macOS local-build prerequisites
+
+Install `gnu-sed` and `md5sha1sum` (e.g., via Homebrew) to run local build scripts. Apple Silicon hosts should also set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building.
+
+## 4. How to update add-ons
+
+Add-ons are the primary extension point of kURL. Each add-on is a versioned directory with a declarative `Manifest`, an `install.sh` shell script, and optional templates.
+
+### Add-on structure
+
+```text
+addons/<name>/<version>/
+  Manifest                # assets, images, host packages to download
+  install.sh              # defines a function named exactly like the add-on
+  host-preflight.yaml     # optional Troubleshoot.sh preflight spec
+  assets/                 # generated at build time
+  images/                 # generated at build time
+  ...
+```
+
+The `install.sh` must define a function with the exact add-on name (e.g., `function containerd()`). Lifecycle hooks are defined in `addons/README.md`:
+
+- `addon_fetch`
+- `addon_load`
+- `addon_preflights`
+- `addon_pre_init`
+- `addon_install` (required)
+- `addon_already_applied`
+- `addon_join`
+- `addon_post_init`
+- `addon_outro`
+
+### Manifest directives
+
+Common directives (see `addons/README.md` for the full list):
+
+```text
+image pause k8s.gcr.io/pause:3.6
+yum libzstd
+yum8 container-selinux
+yumol <pkg>
+apt containerd
+apt24 containerd
+yum2023 containerd
+asset runc https://github.com/opencontainers/runc/releases/download/v1.3.5/runc.amd64
+dockerout rhel-9 addons/containerd/template/Dockerfile.rhel9 1.7.29
+```
+
+### Adding a new add-on version
+
+1. Create or generate the version directory under `addons/<name>/<version>/`.
+2. Ensure it contains `Manifest`, `install.sh`, and `host-preflight.yaml`.
+3. Add or update Testgrid specs under `addons/<name>/template/testgrid/*.yaml` if needed.
+4. Register the new version in `web/src/installers/versions.js`.
+5. Run `make generate-addons` to update `addons-gen.json` and `supported-versions-gen.json`.
+6. Build the package locally: `make dist/<name>-<version>.tar.gz`.
+7. Test with Testgrid or via `make watchrsync` on a remote Linux VM.
+
+### Templated add-ons
+
+Some add-ons are auto-generated from a `template/` directory:
+
+- `addons/containerd/template/` — `script.sh`, Dockerfiles, and testgrid specs.
+- `addons/flannel/template/` — `generate.sh`, `base/`, and testgrid specs.
+
+For templated add-ons, changes should be made in the template, then the version is regenerated. Do not hand-edit generated version directories.
+
+### External add-ons
+
+`kotsadm` is an external add-on (see `docs/arch/adr-003-external-addons.md`). It is built and released from `replicatedhq/kots`, publishes a `versions.json`, and the `import-external-addons` action copies packages into the kURL S3 bucket. The kURL API merges these with internal versions.
+
+## 5. Testing
+
+### Local tests
+
+```bash
+# Lint, vet, Go tests, and manifest check
+make test
+
+# Go tests only
+go test ./cmd/... ./pkg/...
+make -C kurl_util test
+
+# Shell tests in Docker
+make docker-test-shell
+
+# Containerd configure/upgrade regression tests
+make docker-test-containerd
+
+# Build a specific add-on package
+make dist/containerd-1.7.29.tar.gz
+
+# Build Kubernetes host packages for a specific OS
+make build/packages/kubernetes/1.31.14/ubuntu-22.04
+make build/packages/kubernetes/1.31.14/images
+```
+
+### Testgrid integration
+
+Testgrid is the primary end-to-end testing platform. It is a separate repository (`kURL-testgrid`), but the specs that drive it live in this repo:
+
+- `testgrid/specs/os-*.yaml` — OS image pools.
+- `testgrid/specs/deploy.yaml`, `full.yaml`, `latest.yaml`, `storage-migration.yaml`, `customer-migration-specs.yaml`, `k8s-upgrade.yaml` — test scenarios.
+
+Active CI usage is documented in `testgrid/specs/README.md`. Workflows submit runs via the `replicated/tgrun` Docker image:
+
+```bash
+tgrun queue --spec <spec> --os-spec <os-spec> --ref <ref> --api-token <token>
+```
+
+Add-ons can define their own specs under `addons/<name>/template/testgrid/*.yaml`. During add-on tests, `bin/test-addon.sh` substitutes `__testver__` and `__testdist__` placeholders in the spec templates.
+
+### How Testgrid works (high-level)
+
+1. A user or CI invokes `tgrun queue` with a test spec and OS spec.
+2. `tgrun` submits each installer spec to the kURL API to get a runnable URL/hash.
+3. `tgrun` enqueues planned VM instances to the TGAPI (Testgrid API).
+4. `tgrun run` (the runner daemon) polls TGAPI, creates KubeVirt VMs on bare-metal hosts, and runs the test scripts.
+5. VMs report status, logs, Sonobuoy results, and support bundles back to TGAPI.
+6. The web UI reads TGAPI and displays results at `https://testgrid.kurl.sh`.
+
+Key Testgrid API endpoints for public data:
+
+- `POST /api/v1/run/{refId}` — get a run with its instances.
+- `GET /api/v1/instance/{instanceId}/logs` — main instance logs.
+- `GET /api/v1/instance/{nodeId}/node-logs` — per-node logs.
+- `GET /api/v1/instance/{instanceId}/sonobuoy` — Sonobuoy results.
+
+## 6. Build and release
+
+### Primary build targets
+
+```bash
+make build/install.sh      # single-file installer script
+make build/join.sh         # single-file join script
+make build/upgrade.sh      # single-file upgrade script
+make build/tasks.sh        # single-file tasks script
+make dist/<addon>-<version>.tar.gz
+make dist/common.tar.gz
+make dist/kurl-bin-utils-<version>.tar.gz
+make build/bin/kurl        # kurl CLI
+make kurl-util-image       # replicated/kurl-util:alpha
+```
+
+### Script assembly
+
+`build/install.sh` is assembled from `scripts/install.sh` by inlining every file referenced by `. $DIR/scripts/...` between the `# Magic begin` and `# Magic end` markers. The same pattern applies to `join.sh`, `upgrade.sh`, and `tasks.sh`.
+
+### CI/CD workflows
+
+- `.github/workflows/build-test.yaml` — runs on every PR (Go mod tidy, kurl_util tests, kurl build, shell tests, containerd tests).
+- `.github/workflows/deploy-staging.yaml` — runs on every merge to `main`. Builds packages, uploads to `s3://kurl-sh/staging/<version>-<sha>/`, generates `addons-gen.json`/`supported-versions-gen.json`, and queues Testgrid.
+- `.github/workflows/deploy-prod.yaml` — triggered by tags `v*.*.*`. Copies staging packages to `s3://kurl-sh/dist/<version>/`, creates a GitHub release, generates SBOMs, and queues Testgrid.
+- `.github/workflows/test-addon-pr.yaml` — detects modified add-ons in PRs and invokes `test-addon.yaml` for each.
+- `.github/workflows/test-addon.yaml` — builds a single add-on package, uploads to S3, and queues Testgrid using the add-on's template specs.
+- `.github/workflows/update-<addon>.yaml` — scheduled workflows that generate PRs for new upstream add-on versions (e.g., `update-containerd.yaml`, `update-flannel.yaml`).
+
+### Release versioning
+
+- Production tags: `vYYYY.MM.DD-#` (e.g., `v2024.07.02-0`).
+- Staging versions: `<latest-tag>-<short-sha>` (e.g., `v2024.07.02-0-5af497c`).
+- `make tag-and-release` creates a production tag and triggers the release workflow.
+- The current release is advertised by `s3://kurl-sh/dist/VERSION` and `s3://kurl-sh/staging/VERSION`.
+
+## 7. Conventions and gotchas
+
+- **Add-on directory names** are lowercase (`containerd`, `flannel`, `rook`).
+- **Version directories** use the upstream version string (`1.7.29`, `2.8.1`).
+- **All shell functions in an add-on** should be prefixed with the add-on name to avoid collisions (e.g., `containerd_configure`).
+- **`versions.js`** is the source of truth for selectable add-on versions. The generated `addons-gen.json` and `supported-versions-gen.json` are produced by `make generate-addons`.
+- **`scripts/Manifest`** must stay in sync with `hack/testdata/manifest/clean`; `make test` enforces this.
+- **Generated version directories** should not be hand-edited for templated add-ons; edit the template and regenerate.
+- **Testgrid OS filtering is opt-out** via `unsupportedOSIDs`; there is no `supportedOSIDs`.
+- **Linux/amd64 required** for runtime testing. The scripts are not macOS-compatible; use a remote Linux VM or Docker.
+- **On Apple Silicon**, set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building.
+
+## 8. Useful commands
+
+```bash
+# Full local test suite
+make test
+
+# Build the installer script
+make build/install.sh
+
+# Build an add-on package
+make dist/<addon>-<version>.tar.gz
+
+# Generate add-on metadata
+make generate-addons
+
+# Watch and sync builds to a remote Linux test VM
+export GOOS=linux GOARCH=amd64 DOCKER_DEFAULT_PLATFORM=linux/amd64 REMOTES="USER@IP"
+make watchrsync
+
+# Run shell tests in Docker
+make docker-test-shell
+```
+
+## 9. Key files for agents to know
+
+- `Makefile` — primary build orchestration.
+- `scripts/install.sh` — entrypoint source with `Magic begin/end` markers.
+- `scripts/Manifest` and `hack/testdata/manifest/clean` — must stay in sync.
+- `scripts/common/addon.sh` — add-on runtime orchestration.
+- `bin/save-manifest-assets.sh` — downloads/builds all assets described by a `Manifest`.
+- `web/src/installers/versions.js` — human-edited version registry.
+- `addons-gen.json` and `supported-versions-gen.json` — generated API metadata.
+- `.github/workflows/deploy-staging.yaml` and `.github/workflows/deploy-prod.yaml` — release pipelines.
+- `.github/workflows/test-addon-pr.yaml` — PR add-on testing.
+- `bin/test-addon.sh` — submits add-on Testgrid runs.
+- `pkg/cli/commands.go` — `kurl` CLI command tree.

From 535790488c4feafedcce80534842814a7441c973 Mon Sep 17 00:00:00 2001
From: Xav Paice <xav@replicated.com>
Date: Mon, 22 Jun 2026 11:29:26 +1200
Subject: [PATCH 2/5] fix .gitignore: remove .opencode so skill files are
 tracked normally

---
 .gitignore | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index c7ddb7b545..ac71956031 100644
--- a/.gitignore
+++ b/.gitignore
@@ -31,7 +31,6 @@ sbom/
 
 # ai/editor metadata
 .claude
-.opencode
 .cursor
 .aider*
 .continue

From 119bfaa39808bcae2b6a4cb7e0c75a3e5a977c0b Mon Sep 17 00:00:00 2001
From: Xav Paice <xav@replicated.com>
Date: Mon, 22 Jun 2026 11:51:29 +1200
Subject: [PATCH 3/5] agents fixes

---
 .envrc    | 2 +-
 AGENTS.md | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/.envrc b/.envrc
index 57ccbaff3c..0b3d32981d 100644
--- a/.envrc
+++ b/.envrc
@@ -1 +1 @@
-source .env
\ No newline at end of file
+[[ -f .env ]] && source_env .env
diff --git a/AGENTS.md b/AGENTS.md
index 87d1c15281..6d2e3c82cc 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -10,8 +10,8 @@ A user posts a YAML `Installer` manifest to the kURL.sh API and receives a deter
 
 The kURL project has two main repositories:
 
-- **kURL** (`/Users/xav/go/src/github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry.
-- **kURL-testgrid** (`/Users/xav/go/src/github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`.
+- **kURL** (`github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry.
+- **kURL-testgrid** (`github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`.
 
 ## 2. Repository structure
 

From e350717c2cc3e23e70902dcc4e638414aa642cf7 Mon Sep 17 00:00:00 2001
From: Xav Paice <xav@replicated.com>
Date: Mon, 22 Jun 2026 11:52:33 +1200
Subject: [PATCH 4/5] catch URLError alongside HTTPError in testgrid fetch
 script

---
 .../skills/testgrid-failure-analysis/fetch.py      | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/.opencode/skills/testgrid-failure-analysis/fetch.py b/.opencode/skills/testgrid-failure-analysis/fetch.py
index 6af39bddb4..5465ab71bd 100755
--- a/.opencode/skills/testgrid-failure-analysis/fetch.py
+++ b/.opencode/skills/testgrid-failure-analysis/fetch.py
@@ -21,7 +21,7 @@
 import subprocess
 import sys
 from urllib.request import Request, urlopen
-from urllib.error import HTTPError
+from urllib.error import HTTPError, URLError
 
 
 def api_request(url, api_key=None, data=None, method=None, timeout=60):
@@ -135,8 +135,8 @@ def main():
                 with open(os.path.join(inst_dir, "logs.txt"), "w") as f:
                     f.write(logs)
                 print("  saved instance logs")
-        except HTTPError as e:
-            if e.code != 404:
+        except (HTTPError, URLError) as e:
+            if not isinstance(e, HTTPError) or e.code != 404:
                 print(f"  warning: could not fetch instance logs: {e}", file=sys.stderr)
 
         # Sonobuoy results, if present.
@@ -147,8 +147,8 @@ def main():
                 with open(os.path.join(inst_dir, "sonobuoy.txt"), "w") as f:
                     f.write(results)
                 print("  saved sonobuoy results")
-        except HTTPError as e:
-            if e.code != 404:
+        except (HTTPError, URLError) as e:
+            if not isinstance(e, HTTPError) or e.code != 404:
                 print(f"  warning: could not fetch sonobuoy results: {e}", file=sys.stderr)
 
         # Per-node logs and support bundles.
@@ -191,8 +191,8 @@ def main():
                         except Exception as e:
                             print(f"  failed to decrypt {enc_path}: {e}", file=sys.stderr)
 
-            except HTTPError as e:
-                if e.code != 404:
+            except (HTTPError, URLError) as e:
+                if not isinstance(e, HTTPError) or e.code != 404:
                     print(f"  warning: could not fetch {node_id} node logs: {e}", file=sys.stderr)
 
     print(f"\nArtifacts written to: {out_dir}")

From 0cf260977ccf4ddf7979a2f74edfbb8cd667e921 Mon Sep 17 00:00:00 2001
From: Xav Paice <xav@replicated.com>
Date: Tue, 23 Jun 2026 12:29:31 +1200
Subject: [PATCH 5/5] add docs for .env

---
 AGENTS.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/AGENTS.md b/AGENTS.md
index 6d2e3c82cc..f3597995c4 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -179,6 +179,19 @@ tgrun queue --spec <spec> --os-spec <os-spec> --ref <ref> --api-token <token>
 
 Add-ons can define their own specs under `addons/<name>/template/testgrid/*.yaml`. During add-on tests, `bin/test-addon.sh` substitutes `__testver__` and `__testdist__` placeholders in the spec templates.
 
+#### Testgrid credentials for local runs
+
+A few local helpers (for example, `bin/test-addon.sh` and the `testgrid-failure-analysis` skill) read Testgrid secrets from environment variables. Put these values in a `.env` file at the repository root (not `.envrc`) and keep the file out of version control. `.env` is already listed in `.gitignore`.
+
+If you use `direnv`, the `.envrc` in the repo root will automatically source `.env` when you enter the directory. Example variables to set in `.env` (replace `...` with your own secrets):
+
+```bash
+export TESTGRID_API_TOKEN=...
+export TESTGRID_AGE_PASSPHRASE=...
+```
+
+Never commit real values, `.env` files, or `.envrc` to the repository.
+
 ### How Testgrid works (high-level)
 
 1. A user or CI invokes `tgrun queue` with a test spec and OS spec.
@@ -242,6 +255,7 @@ make kurl-util-image       # replicated/kurl-util:alpha
 - **Testgrid OS filtering is opt-out** via `unsupportedOSIDs`; there is no `supportedOSIDs`.
 - **Linux/amd64 required** for runtime testing. The scripts are not macOS-compatible; use a remote Linux VM or Docker.
 - **On Apple Silicon**, set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building.
+- **Keep secrets in `.env`**. `.env` and `.envrc` are gitignored. If you use `direnv`, `.envrc` will automatically load `.env`; otherwise, source it manually. Never commit secrets, `.env`, or `.envrc`.
 
 ## 8. Useful commands