From aea42ac4c2cbf5173a390deaf85b25a32438ac2d Mon Sep 17 00:00:00 2001 From: Xav Paice Date: Mon, 22 Jun 2026 09:46:46 +1200 Subject: [PATCH 1/5] add opencode skill for testgrid --- .envrc | 1 + .gitignore | 4 +- .opencode/.gitignore | 5 + .../skills/testgrid-failure-analysis/SKILL.md | 78 +++++ .../skills/testgrid-failure-analysis/fetch.py | 202 +++++++++++++ AGENTS.md | 281 ++++++++++++++++++ 6 files changed, 569 insertions(+), 2 deletions(-) create mode 100644 .envrc create mode 100644 .opencode/.gitignore create mode 100644 .opencode/skills/testgrid-failure-analysis/SKILL.md create mode 100755 .opencode/skills/testgrid-failure-analysis/fetch.py create mode 100644 AGENTS.md diff --git a/.envrc b/.envrc new file mode 100644 index 0000000000..57ccbaff3c --- /dev/null +++ b/.envrc @@ -0,0 +1 @@ +source .env \ No newline at end of file diff --git a/.gitignore b/.gitignore index 02438d2b42..c7ddb7b545 100644 --- a/.gitignore +++ b/.gitignore @@ -36,11 +36,11 @@ sbom/ .aider* .continue .codeium -CLAUDE.md -AGENTS.md .repomixignore # local workflow plans/ docs/ release-manifest.json + +.env diff --git a/.opencode/.gitignore b/.opencode/.gitignore new file mode 100644 index 0000000000..2ed394f164 --- /dev/null +++ b/.opencode/.gitignore @@ -0,0 +1,5 @@ +node_modules +package.json +package-lock.json +bun.lock +.gitignore \ No newline at end of file diff --git a/.opencode/skills/testgrid-failure-analysis/SKILL.md b/.opencode/skills/testgrid-failure-analysis/SKILL.md new file mode 100644 index 0000000000..9746239c98 --- /dev/null +++ b/.opencode/skills/testgrid-failure-analysis/SKILL.md @@ -0,0 +1,78 @@ +--- +name: testgrid-failure-analysis +description: Use when analyzing a failed Testgrid kURL run to fetch the run results, failure logs, and encrypted support bundles from the Testgrid API and write them into a directory for offline analysis; trigger with "testgrid failure analysis", "fetch testgrid logs", "get support bundle from Testgrid", or "analyze Testgrid run". +--- + +# Testgrid failure analysis + +This skill helps an agent collect the artifacts of a failed [Testgrid](https://testgrid.kurl.sh/) run so they can be analyzed locally. + +## What it does + +1. Queries the Testgrid API for a run by `refId`. +2. Identifies every failed instance (`isSuccess == false`, not unsupported, not skipped, and finished). +3. For each failure, fetches: + - The instance metadata (`instance.json`) + - The main instance logs (populated when the VM fails to start) + - Sonobuoy results, if any + - The per-node logs from the actual test VMs (`{nodeId}.log.txt`) + - Any encrypted support bundles whose S3 URLs are printed in the node logs +4. Writes everything into a structured output directory ready for an agent to inspect. + +## Important details from the codebase + +- Public API base path is `/api/v1`. The endpoints used are: + - `POST /api/v1/run/{refId}` — returns the run with its `instances` array, plus `success_count` and `failure_count`. + - `GET /api/v1/instance/{instanceId}/logs` — returns `{"logs": "..."}` from the `testinstance.output` column. + - `GET /api/v1/instance/{nodeId}/node-logs` — returns `{"logs": "..."}` from the `clusternode.output` column. + - `GET /api/v1/instance/{instanceId}/sonobuoy` — returns `{"results": "..."}`. +- The open-source `/api/v1` endpoints are **not** authenticated by default (the `api-token` auth middleware only protects the runner endpoints under `/v1`). However, an optional `--api-token` is accepted and sent as HTTP Basic Auth with username `token` and the provided password, for deployments that add authentication. `--api-key` is kept as a deprecated alias for backward compatibility. +- Support bundles are collected by the test script (`tgrun/pkg/runner/vmi/embed/runcmd.sh` → `collect_support_bundle`) and uploaded to S3 with the handler at `POST /v1/instance/{instanceId}/bundle`. The S3 URL is printed in the node log output, which is why this skill scans the logs for it. +- The bundle is encrypted with the `age` file format using a scrypt passphrase. The API stores it with key pattern `{instanceId}-{unix}/bundle.tgz.age`. The downloaded file keeps the `.age` extension. +- If you provide the age passphrase, the helper script will try to decrypt each bundle in place with `age -d -p`. + +## Node IDs used by the runner + +Testgrid creates one initial-primary node plus optional additional nodes. The node IDs are predictable from the instance ID and the `numPrimaryNodes` / `numSecondaryNodes` fields, so the skill tries: + +- `{instanceId}-initialprimary` +- `{instanceId}-primary-1` ... `{instanceId}-primary-{numPrimaryNodes-1}` +- `{instanceId}-secondary-0` ... `{instanceId}-secondary-{numSecondaryNodes-1}` + +Only nodes that actually produced logs will be saved. + +## How to use + +Run the helper script shipped with this skill: + +```bash +python3 .opencode/skills/testgrid-failure-analysis/fetch.py \ + --api-endpoint https://api.testgrid.kurl.sh \ + --ref-id \ + --output-dir ./testgrid-analysis/ \ + [--api-token ] \ + [--age-passphrase ] +``` + +Environment variables are also supported: + +- `TESTGRID_API_TOKEN` → `--api-token` (`TESTGRID_API_KEY` is still read as a fallback) +- `TESTGRID_AGE_PASSPHRASE` → `--age-passphrase` + +## Output layout + +``` +/ + run.json # full run response + / + instance.json # instance metadata + logs.txt # main instance output, if any + sonobuoy.txt # sonobuoy results, if any + -initialprimary.log.txt + bundle--0.tgz.age # encrypted support bundle + bundle--0.tgz # decrypted support bundle (if passphrase supplied) +``` + +## What to do next + +After fetching, read the `run.json` summary, open the per-instance logs, and inspect any decrypted support bundles. If a bundle could not be downloaded, grep the corresponding node log for `bundle.tgz.age` to find the raw S3 URL. diff --git a/.opencode/skills/testgrid-failure-analysis/fetch.py b/.opencode/skills/testgrid-failure-analysis/fetch.py new file mode 100755 index 0000000000..6af39bddb4 --- /dev/null +++ b/.opencode/skills/testgrid-failure-analysis/fetch.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +"""Fetch Testgrid failure logs and support bundles for analysis. + +Queries the public Testgrid API endpoints: + GET /api/v1/runs + POST /api/v1/run/{refId} + GET /api/v1/instance/{id}/logs + GET /api/v1/instance/{nodeId}/node-logs + GET /api/v1/instance/{id}/sonobuoy + +For every failed instance in a run, it downloads the instance logs and the per-node +logs, scans them for encrypted support-bundle URLs, downloads the bundles, and +optionally decrypts them with age when an age passphrase is supplied. +""" + +import argparse +import base64 +import json +import os +import re +import subprocess +import sys +from urllib.request import Request, urlopen +from urllib.error import HTTPError + + +def api_request(url, api_key=None, data=None, method=None, timeout=60): + headers = {"Accept": "application/json"} + if api_key: + creds = base64.b64encode(b"token:" + api_key.encode()).decode() + headers["Authorization"] = f"Basic {creds}" + if data is not None: + method = method or "POST" + headers["Content-Type"] = "application/json" + req = Request(url, data=data, headers=headers, method=method) + with urlopen(req, timeout=timeout) as resp: + return resp.read().decode("utf-8") + + +def download(url, path, timeout=300): + req = Request(url, headers={"User-Agent": "testgrid-failure-analysis"}) + with urlopen(req, timeout=timeout) as resp: + with open(path, "wb") as f: + f.write(resp.read()) + + +def node_ids(instance_id, num_primary, num_secondary): + """Return the node IDs the runner creates for a given instance.""" + ids = [f"{instance_id}-initialprimary"] + for i in range(1, max(1, num_primary)): + ids.append(f"{instance_id}-primary-{i}") + for i in range(num_secondary): + ids.append(f"{instance_id}-secondary-{i}") + return ids + + +def save_json(path, obj): + with open(path, "w") as f: + json.dump(obj, f, indent=2) + + +def main(): + parser = argparse.ArgumentParser( + description="Fetch Testgrid run failures, logs, and support bundles.") + parser.add_argument( + "--api-endpoint", required=True, + help="Testgrid API base URL, e.g. https://api.testgrid.kurl.sh") + parser.add_argument( + "--ref-id", required=True, + help="Testgrid run refId (the run identifier shown in the UI)") + parser.add_argument( + "--output-dir", required=True, + help="Directory where artifacts will be written") + parser.add_argument( + "--api-token", "--api-key", dest="api_token", + default=os.environ.get("TESTGRID_API_TOKEN") or os.environ.get("TESTGRID_API_KEY"), + help="Optional API token. If the server requires it, sent as basic-auth password with username 'token'. Reads TESTGRID_API_TOKEN (preferred) or TESTGRID_API_KEY (legacy) if not provided.") + parser.add_argument( + "--age-passphrase", default=os.environ.get("TESTGRID_AGE_PASSPHRASE"), + help="Optional age passphrase to decrypt downloaded .age support bundles") + parser.add_argument( + "--age-bin", default="age", + help="Path to the age decryption binary") + parser.add_argument( + "--page-size", type=int, default=1000, + help="Page size for the run query") + args = parser.parse_args() + + base = args.api_endpoint.rstrip("/") + if not base.endswith("/api/v1"): + base = base + "/api/v1" + + out_dir = args.output_dir + os.makedirs(out_dir, exist_ok=True) + + run_url = f"{base}/run/{args.ref_id}" + print(f"Fetching run {args.ref_id} ...") + run_data = json.loads(api_request( + run_url, args.api_token, + data=json.dumps({"pageSize": args.page_size}).encode(), + timeout=120)) + + save_json(os.path.join(out_dir, "run.json"), run_data) + + instances = run_data.get("instances", []) + failures = [ + i for i in instances + if not i.get("isSuccess") + and not i.get("isUnsupported") + and not i.get("isSkipped") + and i.get("finishedAt") + ] + + print(f"Found {len(failures)} failed instance(s) out of {len(instances)} total.") + if not failures: + return + + bundle_re = re.compile( + r"https?://[^\s\"]+?\.s3\.amazonaws\.com/[^\s\"]+?bundle\.tgz\.age") + + for inst in failures: + iid = inst["id"] + inst_dir = os.path.join(out_dir, iid) + os.makedirs(inst_dir, exist_ok=True) + save_json(os.path.join(inst_dir, "instance.json"), inst) + + os_info = f"{inst.get('osName', '?')} {inst.get('osVersion', '?')}" + print(f"\n{iid} ({os_info}, reason: {inst.get('failureReason', 'unknown')})") + + # Main instance output (usually only populated on VMI startup failures). + try: + txt = api_request(f"{base}/instance/{iid}/logs", args.api_token, timeout=60) + logs = json.loads(txt).get("logs", "") + if logs: + with open(os.path.join(inst_dir, "logs.txt"), "w") as f: + f.write(logs) + print(" saved instance logs") + except HTTPError as e: + if e.code != 404: + print(f" warning: could not fetch instance logs: {e}", file=sys.stderr) + + # Sonobuoy results, if present. + try: + txt = api_request(f"{base}/instance/{iid}/sonobuoy", args.api_token, timeout=60) + results = json.loads(txt).get("results", "") + if results: + with open(os.path.join(inst_dir, "sonobuoy.txt"), "w") as f: + f.write(results) + print(" saved sonobuoy results") + except HTTPError as e: + if e.code != 404: + print(f" warning: could not fetch sonobuoy results: {e}", file=sys.stderr) + + # Per-node logs and support bundles. + num_primary = inst.get("numPrimaryNodes", 1) + num_secondary = inst.get("numSecondaryNodes", 0) + for node_id in node_ids(iid, num_primary, num_secondary): + try: + txt = api_request(f"{base}/instance/{node_id}/node-logs", args.api_token, timeout=60) + logs = json.loads(txt).get("logs", "") + if not logs: + continue + + log_path = os.path.join(inst_dir, f"{node_id}.log.txt") + with open(log_path, "w") as f: + f.write(logs) + print(f" saved {node_id} node logs") + + urls = sorted(set(bundle_re.findall(logs))) + for j, url in enumerate(urls): + enc_path = os.path.join(inst_dir, f"bundle-{node_id}-{j}.tgz.age") + try: + download(url, enc_path) + print(f" downloaded support bundle -> {enc_path}") + except Exception as e: + print(f" failed to download bundle {url}: {e}", file=sys.stderr) + continue + + if args.age_passphrase: + dec_path = os.path.join(inst_dir, f"bundle-{node_id}-{j}.tgz") + try: + with open(dec_path, "wb") as dec_file: + subprocess.run( + [args.age_bin, "-d", "-p", enc_path], + input=args.age_passphrase.encode(), + stdout=dec_file, + check=True, + stderr=subprocess.PIPE, + ) + print(f" decrypted support bundle -> {dec_path}") + except Exception as e: + print(f" failed to decrypt {enc_path}: {e}", file=sys.stderr) + + except HTTPError as e: + if e.code != 404: + print(f" warning: could not fetch {node_id} node logs: {e}", file=sys.stderr) + + print(f"\nArtifacts written to: {out_dir}") + + +if __name__ == "__main__": + main() diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000000..87d1c15281 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,281 @@ +# Agent Guide for kURL + +This guide is for AI agents working in the kURL repository. It explains how the project is organized, how changes are made (especially add-on updates), how to test, and how the repository relates to the kURL-testgrid testing platform. + +## 1. Project overview + +**kURL** is a Kubernetes installer for air-gapped and online clusters, maintained by Replicated. It automates the tasks a cluster administrator must perform before and after running `kubeadm init` to create a production-ready Kubernetes cluster. + +A user posts a YAML `Installer` manifest to the kURL.sh API and receives a deterministic hash. That hash can be used to fetch an install script (`https://kurl.sh/`) or an air-gap bundle (`https://kurl.sh/bundle/.tar.gz`). The installer then downloads pre-built add-on tarballs and host packages from the `kurl-sh` S3 bucket and executes the install/upgrade on the target node. + +The kURL project has two main repositories: + +- **kURL** (`/Users/xav/go/src/github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry. +- **kURL-testgrid** (`/Users/xav/go/src/github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`. + +## 2. Repository structure + +Key directories in this repo: + +| Directory | Purpose | +|---|---| +| `addons///` | Add-on definitions, one directory per version. Each version contains at minimum `Manifest`, `install.sh`, and usually `host-preflight.yaml`. | +| `bin/` | Build, packaging, and release helper scripts. | +| `bundles/` | Docker build contexts for host packages (Kubernetes RPM/DEB repos per OS). | +| `cmd/` | Go entrypoints; `cmd/kurl` is the main CLI binary. | +| `hack/` | Local development helpers, test data, and test Dockerfiles. | +| `kurl_util/` | Go utilities built into binaries and the `replicated/kurl-util` Docker image. | +| `packages/` | Host-package definitions (e.g., `kubernetes`, `host/openssl`, `host/fio`). | +| `pkg/` | Go library code for the `kurl` CLI. | +| `scripts/` | The bash installer source (`install.sh`, `join.sh`, `upgrade.sh`, `tasks.sh`, `common/`). | +| `testgrid/specs/` | Testgrid YAML specs for OS images and test scenarios. | +| `tools/` | Additional tooling. | +| `web/src/installers/` | Frontend version registry; `versions.js` lists available add-on versions. | +| `.github/workflows/` | CI/CD workflows. | + +Important documents: + +- `README.md` — project overview and community links. +- `ARCHITECTURE.md` — manifest format, API services, object storage, add-on lifecycle, release workflows. +- `CONTRIBUTING.md` — development workflow, remote testing, environment setup. +- `addons/README.md` — add-on structure, Manifest directives, lifecycle hooks. +- `testgrid/specs/README.md` — how Testgrid specs work. +- `docs/arch/adr-003-external-addons.md` — external add-on model (`kotsadm`). +- `CODEOWNERS` — global owner `@replicatedhq/embedded-kubernetes`. + +## 3. How to approach changes + +### General workflow + +1. **Identify the scope.** Is this a core installer change, an add-on change, a Testgrid-only change, or a version bump? Core scripts live in `scripts/`. Add-ons live in `addons///`. Testgrid specs live in `testgrid/specs/` and in add-on `template/testgrid/` directories. +2. **Make the minimal change.** Edit only the source files. Generated files (e.g., `build/`, `dist/`, `addons-gen.json`, `supported-versions-gen.json`) are produced by `make` targets and should not be hand-edited. +3. **Keep `scripts/Manifest` in sync with `hack/testdata/manifest/clean`.** `make test` compares these two files and fails if they differ. If you modify `scripts/Manifest`, update the test data copy as well. +4. **Run local tests.** See the Testing section below. +5. **Let CI run Testgrid for add-on changes.** The `test-addon-pr.yaml` workflow detects modified add-ons and queues Testgrid runs automatically. + +### When working on a remote Linux VM + +The install scripts are not macOS-compatible. For real-world testing, use `make watchrsync` after exporting: + +```bash +export GOOS=linux +export GOARCH=amd64 +export DOCKER_DEFAULT_PLATFORM=linux/amd64 +export REMOTES="USER@TARGET_SERVER_IP" +make watchrsync +``` + +`bin/watchrsync.js` continuously syncs local builds to `~/kurl` on the remote server. + +### macOS local-build prerequisites + +Install `gnu-sed` and `md5sha1sum` (e.g., via Homebrew) to run local build scripts. Apple Silicon hosts should also set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building. + +## 4. How to update add-ons + +Add-ons are the primary extension point of kURL. Each add-on is a versioned directory with a declarative `Manifest`, an `install.sh` shell script, and optional templates. + +### Add-on structure + +```text +addons/// + Manifest # assets, images, host packages to download + install.sh # defines a function named exactly like the add-on + host-preflight.yaml # optional Troubleshoot.sh preflight spec + assets/ # generated at build time + images/ # generated at build time + ... +``` + +The `install.sh` must define a function with the exact add-on name (e.g., `function containerd()`). Lifecycle hooks are defined in `addons/README.md`: + +- `addon_fetch` +- `addon_load` +- `addon_preflights` +- `addon_pre_init` +- `addon_install` (required) +- `addon_already_applied` +- `addon_join` +- `addon_post_init` +- `addon_outro` + +### Manifest directives + +Common directives (see `addons/README.md` for the full list): + +```text +image pause k8s.gcr.io/pause:3.6 +yum libzstd +yum8 container-selinux +yumol +apt containerd +apt24 containerd +yum2023 containerd +asset runc https://github.com/opencontainers/runc/releases/download/v1.3.5/runc.amd64 +dockerout rhel-9 addons/containerd/template/Dockerfile.rhel9 1.7.29 +``` + +### Adding a new add-on version + +1. Create or generate the version directory under `addons///`. +2. Ensure it contains `Manifest`, `install.sh`, and `host-preflight.yaml`. +3. Add or update Testgrid specs under `addons//template/testgrid/*.yaml` if needed. +4. Register the new version in `web/src/installers/versions.js`. +5. Run `make generate-addons` to update `addons-gen.json` and `supported-versions-gen.json`. +6. Build the package locally: `make dist/-.tar.gz`. +7. Test with Testgrid or via `make watchrsync` on a remote Linux VM. + +### Templated add-ons + +Some add-ons are auto-generated from a `template/` directory: + +- `addons/containerd/template/` — `script.sh`, Dockerfiles, and testgrid specs. +- `addons/flannel/template/` — `generate.sh`, `base/`, and testgrid specs. + +For templated add-ons, changes should be made in the template, then the version is regenerated. Do not hand-edit generated version directories. + +### External add-ons + +`kotsadm` is an external add-on (see `docs/arch/adr-003-external-addons.md`). It is built and released from `replicatedhq/kots`, publishes a `versions.json`, and the `import-external-addons` action copies packages into the kURL S3 bucket. The kURL API merges these with internal versions. + +## 5. Testing + +### Local tests + +```bash +# Lint, vet, Go tests, and manifest check +make test + +# Go tests only +go test ./cmd/... ./pkg/... +make -C kurl_util test + +# Shell tests in Docker +make docker-test-shell + +# Containerd configure/upgrade regression tests +make docker-test-containerd + +# Build a specific add-on package +make dist/containerd-1.7.29.tar.gz + +# Build Kubernetes host packages for a specific OS +make build/packages/kubernetes/1.31.14/ubuntu-22.04 +make build/packages/kubernetes/1.31.14/images +``` + +### Testgrid integration + +Testgrid is the primary end-to-end testing platform. It is a separate repository (`kURL-testgrid`), but the specs that drive it live in this repo: + +- `testgrid/specs/os-*.yaml` — OS image pools. +- `testgrid/specs/deploy.yaml`, `full.yaml`, `latest.yaml`, `storage-migration.yaml`, `customer-migration-specs.yaml`, `k8s-upgrade.yaml` — test scenarios. + +Active CI usage is documented in `testgrid/specs/README.md`. Workflows submit runs via the `replicated/tgrun` Docker image: + +```bash +tgrun queue --spec --os-spec --ref --api-token +``` + +Add-ons can define their own specs under `addons//template/testgrid/*.yaml`. During add-on tests, `bin/test-addon.sh` substitutes `__testver__` and `__testdist__` placeholders in the spec templates. + +### How Testgrid works (high-level) + +1. A user or CI invokes `tgrun queue` with a test spec and OS spec. +2. `tgrun` submits each installer spec to the kURL API to get a runnable URL/hash. +3. `tgrun` enqueues planned VM instances to the TGAPI (Testgrid API). +4. `tgrun run` (the runner daemon) polls TGAPI, creates KubeVirt VMs on bare-metal hosts, and runs the test scripts. +5. VMs report status, logs, Sonobuoy results, and support bundles back to TGAPI. +6. The web UI reads TGAPI and displays results at `https://testgrid.kurl.sh`. + +Key Testgrid API endpoints for public data: + +- `POST /api/v1/run/{refId}` — get a run with its instances. +- `GET /api/v1/instance/{instanceId}/logs` — main instance logs. +- `GET /api/v1/instance/{nodeId}/node-logs` — per-node logs. +- `GET /api/v1/instance/{instanceId}/sonobuoy` — Sonobuoy results. + +## 6. Build and release + +### Primary build targets + +```bash +make build/install.sh # single-file installer script +make build/join.sh # single-file join script +make build/upgrade.sh # single-file upgrade script +make build/tasks.sh # single-file tasks script +make dist/-.tar.gz +make dist/common.tar.gz +make dist/kurl-bin-utils-.tar.gz +make build/bin/kurl # kurl CLI +make kurl-util-image # replicated/kurl-util:alpha +``` + +### Script assembly + +`build/install.sh` is assembled from `scripts/install.sh` by inlining every file referenced by `. $DIR/scripts/...` between the `# Magic begin` and `# Magic end` markers. The same pattern applies to `join.sh`, `upgrade.sh`, and `tasks.sh`. + +### CI/CD workflows + +- `.github/workflows/build-test.yaml` — runs on every PR (Go mod tidy, kurl_util tests, kurl build, shell tests, containerd tests). +- `.github/workflows/deploy-staging.yaml` — runs on every merge to `main`. Builds packages, uploads to `s3://kurl-sh/staging/-/`, generates `addons-gen.json`/`supported-versions-gen.json`, and queues Testgrid. +- `.github/workflows/deploy-prod.yaml` — triggered by tags `v*.*.*`. Copies staging packages to `s3://kurl-sh/dist//`, creates a GitHub release, generates SBOMs, and queues Testgrid. +- `.github/workflows/test-addon-pr.yaml` — detects modified add-ons in PRs and invokes `test-addon.yaml` for each. +- `.github/workflows/test-addon.yaml` — builds a single add-on package, uploads to S3, and queues Testgrid using the add-on's template specs. +- `.github/workflows/update-.yaml` — scheduled workflows that generate PRs for new upstream add-on versions (e.g., `update-containerd.yaml`, `update-flannel.yaml`). + +### Release versioning + +- Production tags: `vYYYY.MM.DD-#` (e.g., `v2024.07.02-0`). +- Staging versions: `-` (e.g., `v2024.07.02-0-5af497c`). +- `make tag-and-release` creates a production tag and triggers the release workflow. +- The current release is advertised by `s3://kurl-sh/dist/VERSION` and `s3://kurl-sh/staging/VERSION`. + +## 7. Conventions and gotchas + +- **Add-on directory names** are lowercase (`containerd`, `flannel`, `rook`). +- **Version directories** use the upstream version string (`1.7.29`, `2.8.1`). +- **All shell functions in an add-on** should be prefixed with the add-on name to avoid collisions (e.g., `containerd_configure`). +- **`versions.js`** is the source of truth for selectable add-on versions. The generated `addons-gen.json` and `supported-versions-gen.json` are produced by `make generate-addons`. +- **`scripts/Manifest`** must stay in sync with `hack/testdata/manifest/clean`; `make test` enforces this. +- **Generated version directories** should not be hand-edited for templated add-ons; edit the template and regenerate. +- **Testgrid OS filtering is opt-out** via `unsupportedOSIDs`; there is no `supportedOSIDs`. +- **Linux/amd64 required** for runtime testing. The scripts are not macOS-compatible; use a remote Linux VM or Docker. +- **On Apple Silicon**, set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building. + +## 8. Useful commands + +```bash +# Full local test suite +make test + +# Build the installer script +make build/install.sh + +# Build an add-on package +make dist/-.tar.gz + +# Generate add-on metadata +make generate-addons + +# Watch and sync builds to a remote Linux test VM +export GOOS=linux GOARCH=amd64 DOCKER_DEFAULT_PLATFORM=linux/amd64 REMOTES="USER@IP" +make watchrsync + +# Run shell tests in Docker +make docker-test-shell +``` + +## 9. Key files for agents to know + +- `Makefile` — primary build orchestration. +- `scripts/install.sh` — entrypoint source with `Magic begin/end` markers. +- `scripts/Manifest` and `hack/testdata/manifest/clean` — must stay in sync. +- `scripts/common/addon.sh` — add-on runtime orchestration. +- `bin/save-manifest-assets.sh` — downloads/builds all assets described by a `Manifest`. +- `web/src/installers/versions.js` — human-edited version registry. +- `addons-gen.json` and `supported-versions-gen.json` — generated API metadata. +- `.github/workflows/deploy-staging.yaml` and `.github/workflows/deploy-prod.yaml` — release pipelines. +- `.github/workflows/test-addon-pr.yaml` — PR add-on testing. +- `bin/test-addon.sh` — submits add-on Testgrid runs. +- `pkg/cli/commands.go` — `kurl` CLI command tree. From 535790488c4feafedcce80534842814a7441c973 Mon Sep 17 00:00:00 2001 From: Xav Paice Date: Mon, 22 Jun 2026 11:29:26 +1200 Subject: [PATCH 2/5] fix .gitignore: remove .opencode so skill files are tracked normally --- .gitignore | 1 - 1 file changed, 1 deletion(-) diff --git a/.gitignore b/.gitignore index c7ddb7b545..ac71956031 100644 --- a/.gitignore +++ b/.gitignore @@ -31,7 +31,6 @@ sbom/ # ai/editor metadata .claude -.opencode .cursor .aider* .continue From 119bfaa39808bcae2b6a4cb7e0c75a3e5a977c0b Mon Sep 17 00:00:00 2001 From: Xav Paice Date: Mon, 22 Jun 2026 11:51:29 +1200 Subject: [PATCH 3/5] agents fixes --- .envrc | 2 +- AGENTS.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.envrc b/.envrc index 57ccbaff3c..0b3d32981d 100644 --- a/.envrc +++ b/.envrc @@ -1 +1 @@ -source .env \ No newline at end of file +[[ -f .env ]] && source_env .env diff --git a/AGENTS.md b/AGENTS.md index 87d1c15281..6d2e3c82cc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -10,8 +10,8 @@ A user posts a YAML `Installer` manifest to the kURL.sh API and receives a deter The kURL project has two main repositories: -- **kURL** (`/Users/xav/go/src/github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry. -- **kURL-testgrid** (`/Users/xav/go/src/github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`. +- **kURL** (`github.com/replicatedhq/kURL`) — builds the installer scripts, add-on packages, host packages, Go utilities, and the public API registry. +- **kURL-testgrid** (`github.com/replicatedhq/kURL-testgrid`) — the test automation platform that provisions real Linux VMs, runs kURL installers, executes Sonobuoy conformance tests, and publishes results at `https://testgrid.kurl.sh`. ## 2. Repository structure From e350717c2cc3e23e70902dcc4e638414aa642cf7 Mon Sep 17 00:00:00 2001 From: Xav Paice Date: Mon, 22 Jun 2026 11:52:33 +1200 Subject: [PATCH 4/5] catch URLError alongside HTTPError in testgrid fetch script --- .../skills/testgrid-failure-analysis/fetch.py | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/.opencode/skills/testgrid-failure-analysis/fetch.py b/.opencode/skills/testgrid-failure-analysis/fetch.py index 6af39bddb4..5465ab71bd 100755 --- a/.opencode/skills/testgrid-failure-analysis/fetch.py +++ b/.opencode/skills/testgrid-failure-analysis/fetch.py @@ -21,7 +21,7 @@ import subprocess import sys from urllib.request import Request, urlopen -from urllib.error import HTTPError +from urllib.error import HTTPError, URLError def api_request(url, api_key=None, data=None, method=None, timeout=60): @@ -135,8 +135,8 @@ def main(): with open(os.path.join(inst_dir, "logs.txt"), "w") as f: f.write(logs) print(" saved instance logs") - except HTTPError as e: - if e.code != 404: + except (HTTPError, URLError) as e: + if not isinstance(e, HTTPError) or e.code != 404: print(f" warning: could not fetch instance logs: {e}", file=sys.stderr) # Sonobuoy results, if present. @@ -147,8 +147,8 @@ def main(): with open(os.path.join(inst_dir, "sonobuoy.txt"), "w") as f: f.write(results) print(" saved sonobuoy results") - except HTTPError as e: - if e.code != 404: + except (HTTPError, URLError) as e: + if not isinstance(e, HTTPError) or e.code != 404: print(f" warning: could not fetch sonobuoy results: {e}", file=sys.stderr) # Per-node logs and support bundles. @@ -191,8 +191,8 @@ def main(): except Exception as e: print(f" failed to decrypt {enc_path}: {e}", file=sys.stderr) - except HTTPError as e: - if e.code != 404: + except (HTTPError, URLError) as e: + if not isinstance(e, HTTPError) or e.code != 404: print(f" warning: could not fetch {node_id} node logs: {e}", file=sys.stderr) print(f"\nArtifacts written to: {out_dir}") From 0cf260977ccf4ddf7979a2f74edfbb8cd667e921 Mon Sep 17 00:00:00 2001 From: Xav Paice Date: Tue, 23 Jun 2026 12:29:31 +1200 Subject: [PATCH 5/5] add docs for .env --- AGENTS.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 6d2e3c82cc..f3597995c4 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -179,6 +179,19 @@ tgrun queue --spec --os-spec --ref --api-token Add-ons can define their own specs under `addons//template/testgrid/*.yaml`. During add-on tests, `bin/test-addon.sh` substitutes `__testver__` and `__testdist__` placeholders in the spec templates. +#### Testgrid credentials for local runs + +A few local helpers (for example, `bin/test-addon.sh` and the `testgrid-failure-analysis` skill) read Testgrid secrets from environment variables. Put these values in a `.env` file at the repository root (not `.envrc`) and keep the file out of version control. `.env` is already listed in `.gitignore`. + +If you use `direnv`, the `.envrc` in the repo root will automatically source `.env` when you enter the directory. Example variables to set in `.env` (replace `...` with your own secrets): + +```bash +export TESTGRID_API_TOKEN=... +export TESTGRID_AGE_PASSPHRASE=... +``` + +Never commit real values, `.env` files, or `.envrc` to the repository. + ### How Testgrid works (high-level) 1. A user or CI invokes `tgrun queue` with a test spec and OS spec. @@ -242,6 +255,7 @@ make kurl-util-image # replicated/kurl-util:alpha - **Testgrid OS filtering is opt-out** via `unsupportedOSIDs`; there is no `supportedOSIDs`. - **Linux/amd64 required** for runtime testing. The scripts are not macOS-compatible; use a remote Linux VM or Docker. - **On Apple Silicon**, set `GOOS=linux`, `GOARCH=amd64`, and `DOCKER_DEFAULT_PLATFORM=linux/amd64` before building. +- **Keep secrets in `.env`**. `.env` and `.envrc` are gitignored. If you use `direnv`, `.envrc` will automatically load `.env`; otherwise, source it manually. Never commit secrets, `.env`, or `.envrc`. ## 8. Useful commands