Harden job logging and timeout cleanup (#476)

wesm · web-flow · commit 8c707e0be53a · 2026-03-08T10:35:06.000-05:00
## Summary

This PR hardens two failure paths that show up under resource pressure:

1. Per-job log files could be missing even though the job appeared in
the TUI.
2. Jobs could appear to run well past `job_timeout_minutes` because the
worker context timed out but the agent subprocess/pipe handling did not
always unwind promptly.

The primary goal here is reliability, not changing policy. The existing
default 30 minute timeout remains in place; this change makes that
timeout and per-job logging more dependable.

## What can go wrong today

### Scenario 1: log file creation fails when the job starts

Per-job logging was best-effort. We attempted to open
`~/.roborev/logs/jobs/&lt;id&gt;.log` once at job start. If that failed
because of a transient filesystem or resource problem, roborev logged a
warning and permanently disabled disk logging for that job.

On a resource-constrained machine, that can happen if:
- the machine is under RAM pressure and filesystem operations stall or
fail transiently
- the data directory is temporarily unavailable or in a bad state
- writes start failing mid-job and never recover into a reopened log
file

The visible result is a job in the TUI with no corresponding job log
file, even though the job itself may finish.

### Scenario 2: the job timeout fires, but the review does not actually
unwind

The worker already applies a hard timeout via
`context.WithTimeout(...)`, defaulting to 30 minutes. The bug is that
some agent subprocess paths could still remain stuck in process wait /
pipe handling after the context deadline.

That means a user can observe a review job apparently "running" for
longer than the configured timeout, especially when subprocesses are
slow to terminate or inherited pipes remain open.

On low-resource machines this is more likely because:
- child processes can stall while under memory pressure
- cleanup can take longer
- stdout pipe / background process behavior can keep `Wait()` from
returning promptly

## What this PR changes

### 1. Retryable job log writer

Replace the old one-shot `safeWriter` behavior with a retrying
`jobLogWriter`:
- it keeps trying to open/reopen the per-job log file after transient
failures
- it buffers a bounded amount of output in memory while disk logging is
unavailable
- it flushes buffered content once logging becomes available again
- if the buffer overflows, it records how many bytes were dropped
instead of silently disabling logging forever

This means a transient log-file open/write failure no longer dooms the
entire job to have no disk log.

### 2. Centralized subprocess wait configuration and timeout cleanup

Add shared subprocess helpers so all agent adapters use the same
wait-delay handling.

For streaming adapters, close the stdout pipe when the job context is
done. This helps break cases where the context deadline has fired but
the reader / wait path is still blocked behind lingering pipe state.

After `cmd.Wait()` completes, return `context.DeadlineExceeded` when the
job context has expired instead of surfacing a generic subprocess error.

### 3. Explicit timeout handling in the worker

When an agent returns a deadline error, the worker now records a stable
job error like:

`agent timeout after 30m0s`

and sends it through the normal retry/failover path.

This makes timeout failures clearer in the database, TUI, and hooks, and
avoids jobs looking like ambiguous generic agent failures.

## Why this fixes the reported behavior

The reported symptoms were:
- not every displayed job had a log file
- some review jobs seemed to hang for over an hour
- the machine had limited RAM

This PR directly addresses those failure modes:
- transient resource issues no longer permanently disable per-job disk
logging
- timed-out agent jobs are much more likely to unwind promptly instead
of lingering past the configured timeout

## Validation

- `go fmt ./...`
- `go vet ./...`
- `go test ./...`

## Notes from local investigation

On the investigated machine there are more jobs in `reviews.db` than
files in `~/.roborev/logs/jobs`, but the mismatch is historical rather
than current: jobs from the last 7 days all had logs. The older missing
logs line up with the existing log-retention/cleanup behavior and prior
best-effort logging behavior. This PR is aimed at preventing new gaps
under transient resource pressure and making timeout enforcement behave
consistently.
diff --git a/internal/agent/claude.go b/internal/agent/claude.go
@@ -12,7 +12,6 @@ import (
 	"slices"
 	"strings"
 	"sync"
-	"time"
 )
 
 // ClaudeAgent runs code reviews using Claude Code CLI
@@ -135,7 +134,7 @@ func (a *ClaudeAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
-	cmd.WaitDelay = 5 * time.Second
+	tracker := configureSubprocess(cmd)
 
 	// Strip CLAUDECODE to prevent nested-session detection (#270),
 	// and handle API key (configured key or subscription auth).
@@ -153,6 +152,8 @@ func (a *ClaudeAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 	cmd.Stderr = &stderr
 
 	// Always pipe prompt via stdin (stream-json mode)
@@ -166,6 +167,9 @@ func (a *ClaudeAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	result, err := parseStreamJSON(stdoutPipe, output)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, err); ctxErr != nil {
+			return "", ctxErr
+		}
 		// Build a detailed error including any partial output and stream errors
 		var detail strings.Builder
 		fmt.Fprintf(&detail, "%s failed", a.Name())
@@ -186,6 +190,10 @@ func (a *ClaudeAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 		return "", fmt.Errorf("%s: %w", detail.String(), waitErr)
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, err); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if err != nil {
 		return "", err
 	}
diff --git a/internal/agent/codex.go b/internal/agent/codex.go
@@ -11,7 +11,6 @@ import (
 	"os/exec"
 	"strings"
 	"sync"
-	"time"
 )
 
 // CodexAgent runs code reviews using the Codex CLI
@@ -196,7 +195,7 @@ func (a *CodexAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str
 
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
-	cmd.WaitDelay = 5 * time.Second
+	tracker := configureSubprocess(cmd)
 
 	// Pipe prompt via stdin to avoid command line length limits on Windows.
 	// Windows has a ~32KB limit on command line arguments, which large diffs easily exceed.
@@ -210,6 +209,8 @@ func (a *CodexAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 	// Tee stderr to output writer for live error visibility
 	if sw != nil {
 		cmd.Stderr = io.MultiWriter(&stderr, sw)
@@ -225,12 +226,19 @@ func (a *CodexAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str
 	result, parseErr := a.parseStreamJSON(stdoutPipe, sw)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, parseErr); ctxErr != nil {
+			return "", ctxErr
+		}
 		if parseErr != nil {
 			return "", fmt.Errorf("codex failed: %w (parse error: %v)\nstderr: %s", waitErr, parseErr, stderr.String())
 		}
 		return "", fmt.Errorf("codex failed: %w\nstderr: %s", waitErr, stderr.String())
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, parseErr); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if parseErr != nil {
 		if errors.Is(parseErr, errNoCodexJSON) {
 			return "", fmt.Errorf("codex CLI did not emit valid --json events; upgrade codex or check CLI compatibility: %w", errNoCodexJSON)
diff --git a/internal/agent/codex_test.go b/internal/agent/codex_test.go
@@ -7,6 +7,7 @@ import (
 	"slices"
 	"strings"
 	"testing"
+	"time"
 )
 
 func setupMockCodex(t *testing.T, unsafe bool, opts MockCLIOpts) (*CodexAgent, *MockCLIResult) {
@@ -116,6 +117,39 @@ func TestCodexReviewAlwaysAddsAutoApprove(t *testing.T) {
 	}
 }
 
+func TestCodexReviewTimeoutClosesStdoutPipe(t *testing.T) {
+	skipIfWindows(t)
+
+	prevWaitDelay := subprocessWaitDelay
+	subprocessWaitDelay = 20 * time.Millisecond
+	t.Cleanup(func() {
+		subprocessWaitDelay = prevWaitDelay
+	})
+
+	cmdPath := writeTempCommand(t, `#!/bin/sh
+if [ "$1" = "--help" ]; then
+  echo "usage `+codexAutoApproveFlag+`"
+  exit 0
+fi
+(sleep 2) &
+printf '%s\n' '{"type":"item.completed","item":{"type":"agent_message","text":"partial"}}'
+exit 0
+`)
+
+	a := NewCodexAgent(cmdPath)
+	ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
+	defer cancel()
+
+	start := time.Now()
+	_, err := a.Review(ctx, t.TempDir(), "deadbeef", "prompt", nil)
+	if !errors.Is(err, context.DeadlineExceeded) {
+		t.Fatalf("expected context deadline exceeded, got %v", err)
+	}
+	if elapsed := time.Since(start); elapsed > time.Second {
+		t.Fatalf("Review hung for %v after timeout", elapsed)
+	}
+}
+
 func TestCodexParseStreamJSON(t *testing.T) {
 	a := NewCodexAgent("codex")
 
diff --git a/internal/agent/copilot.go b/internal/agent/copilot.go
@@ -85,6 +85,7 @@ func (a *CopilotAgent) Review(ctx context.Context, repoPath, commitSHA, prompt s
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Stdin = strings.NewReader(prompt)
 	cmd.Dir = repoPath
+	tracker := configureSubprocess(cmd)
 
 	var stdout, stderr bytes.Buffer
 	if sw := newSyncWriter(output); sw != nil {
@@ -96,6 +97,9 @@ func (a *CopilotAgent) Review(ctx context.Context, repoPath, commitSHA, prompt s
 	}
 
 	if err := cmd.Run(); err != nil {
+		if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {
+			return "", ctxErr
+		}
 		return "", fmt.Errorf("copilot failed: %w\nstderr: %s", err, stderr.String())
 	}
 
diff --git a/internal/agent/cursor.go b/internal/agent/cursor.go
@@ -8,7 +8,6 @@ import (
 	"os"
 	"os/exec"
 	"strings"
-	"time"
 )
 
 // CursorAgent runs code reviews using the Cursor agent CLI
@@ -112,14 +111,16 @@ func (a *CursorAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
 	cmd.Env = os.Environ()
-	cmd.WaitDelay = 5 * time.Second
+	tracker := configureSubprocess(cmd)
 	cmd.Stdin = strings.NewReader(prompt)
 
 	var stderr bytes.Buffer
 	stdoutPipe, err := cmd.StdoutPipe()
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 	cmd.Stderr = &stderr
 
 	if err := cmd.Start(); err != nil {
@@ -130,12 +131,19 @@ func (a *CursorAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	result, err := a.parseStreamJSON(stdoutPipe, output)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, err); ctxErr != nil {
+			return "", ctxErr
+		}
 		if err != nil {
 			return "", fmt.Errorf("cursor agent failed: %w (parse error: %v)\nstderr: %s", waitErr, err, stderr.String())
 		}
 		return "", fmt.Errorf("cursor agent failed: %w\nstderr: %s", waitErr, stderr.String())
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, err); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if err != nil {
 		return "", err
 	}
diff --git a/internal/agent/droid.go b/internal/agent/droid.go
@@ -104,6 +104,7 @@ func (a *DroidAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
 	cmd.Stdin = strings.NewReader(prompt)
+	tracker := configureSubprocess(cmd)
 
 	var stdout, stderr bytes.Buffer
 	cmd.Stdout = &stdout
@@ -114,6 +115,9 @@ func (a *DroidAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str
 	}
 
 	if err := cmd.Run(); err != nil {
+		if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {
+			return "", ctxErr
+		}
 		return "", fmt.Errorf("droid failed: %w\nstderr: %s", err, stderr.String())
 	}
 
diff --git a/internal/agent/gemini.go b/internal/agent/gemini.go
@@ -10,7 +10,6 @@ import (
 	"io"
 	"os/exec"
 	"strings"
-	"time"
 )
 
 // errNoStreamJSON indicates no valid stream-json events were parsed.
@@ -117,7 +116,7 @@ func (a *GeminiAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
-	cmd.WaitDelay = 5 * time.Second
+	tracker := configureSubprocess(cmd)
 
 	// Pipe prompt via stdin
 	cmd.Stdin = strings.NewReader(prompt)
@@ -130,6 +129,8 @@ func (a *GeminiAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 	// Tee stderr to output writer for live error visibility
 	if sw != nil {
 		cmd.Stderr = io.MultiWriter(&stderr, sw)
@@ -145,12 +146,19 @@ func (a *GeminiAgent) Review(ctx context.Context, repoPath, commitSHA, prompt st
 	parsed, parseErr := a.parseStreamJSON(stdoutPipe, sw)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, parseErr); ctxErr != nil {
+			return "", ctxErr
+		}
 		if parseErr != nil {
 			return "", fmt.Errorf("gemini failed: %w (parse error: %v)\nstderr: %s", waitErr, parseErr, truncateStderr(stderr.String()))
 		}
 		return "", fmt.Errorf("gemini failed: %w\nstderr: %s", waitErr, truncateStderr(stderr.String()))
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, parseErr); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if parseErr != nil {
 		if errors.Is(parseErr, errNoStreamJSON) {
 			return "", fmt.Errorf("gemini CLI must support --output-format stream-json; upgrade to latest version\nstderr: %s: %w", truncateStderr(stderr.String()), errNoStreamJSON)
diff --git a/internal/agent/kilo.go b/internal/agent/kilo.go
@@ -106,6 +106,7 @@ func (a *KiloAgent) Review(
 	cmd := exec.CommandContext(ctx, a.Command, a.buildArgs()...)
 	cmd.Dir = repoPath
 	cmd.Stdin = strings.NewReader(prompt)
+	tracker := configureSubprocess(cmd)
 
 	sw := newSyncWriter(output)
 
@@ -120,6 +121,8 @@ func (a *KiloAgent) Review(
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 
 	if err := cmd.Start(); err != nil {
 		return "", fmt.Errorf("start kilo: %w", err)
@@ -140,6 +143,9 @@ func (a *KiloAgent) Review(
 	_, _ = io.Copy(&stdoutRaw, stdoutPipe)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, parseErr); ctxErr != nil {
+			return "", ctxErr
+		}
 		var detail strings.Builder
 		fmt.Fprintf(&detail, "kilo failed")
 		if parseErr != nil {
@@ -165,6 +171,10 @@ func (a *KiloAgent) Review(
 		return "", fmt.Errorf("%s: %w", detail.String(), waitErr)
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, parseErr); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if parseErr != nil {
 		return result, parseErr
 	}
diff --git a/internal/agent/kiro.go b/internal/agent/kiro.go
@@ -8,7 +8,6 @@ import (
 	"os"
 	"os/exec"
 	"strings"
-	"time"
 )
 
 // maxPromptArgLen is a conservative limit for passing prompts as
@@ -150,7 +149,7 @@ func (a *KiroAgent) Review(ctx context.Context, repoPath, commitSHA, prompt stri
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
 	cmd.Env = os.Environ()
-	cmd.WaitDelay = 5 * time.Second
+	tracker := configureSubprocess(cmd)
 
 	// kiro-cli emits ANSI terminal escape codes that are not
 	// suitable for streaming. Capture and return stripped text.
@@ -159,6 +158,9 @@ func (a *KiroAgent) Review(ctx context.Context, repoPath, commitSHA, prompt stri
 	cmd.Stderr = &stderr
 
 	if err := cmd.Run(); err != nil {
+		if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {
+			return "", ctxErr
+		}
 		return "", fmt.Errorf(
 			"kiro failed: %w\nstderr: %s",
 			err, stderr.String(),
diff --git a/internal/agent/opencode.go b/internal/agent/opencode.go
@@ -93,6 +93,7 @@ func (a *OpenCodeAgent) Review(
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
 	cmd.Stdin = strings.NewReader(prompt)
+	tracker := configureSubprocess(cmd)
 
 	// Share a single syncWriter so stdout and stderr writes
 	// to the output writer are serialized by one mutex.
@@ -109,6 +110,8 @@ func (a *OpenCodeAgent) Review(
 	if err != nil {
 		return "", fmt.Errorf("create stdout pipe: %w", err)
 	}
+	stopClosingPipe := closeOnContextDone(ctx, stdoutPipe)
+	defer stopClosingPipe()
 
 	if err := cmd.Start(); err != nil {
 		return "", fmt.Errorf("start opencode: %w", err)
@@ -124,6 +127,9 @@ func (a *OpenCodeAgent) Review(
 	_, _ = io.Copy(io.Discard, stdoutPipe)
 
 	if waitErr := cmd.Wait(); waitErr != nil {
+		if ctxErr := contextProcessError(ctx, tracker, waitErr, parseErr); ctxErr != nil {
+			return "", ctxErr
+		}
 		var detail strings.Builder
 		fmt.Fprintf(&detail, "opencode failed")
 		if parseErr != nil {
@@ -142,6 +148,10 @@ func (a *OpenCodeAgent) Review(
 		return "", fmt.Errorf("%s: %w", detail.String(), waitErr)
 	}
 
+	if ctxErr := contextProcessError(ctx, tracker, nil, parseErr); ctxErr != nil {
+		return "", ctxErr
+	}
+
 	if parseErr != nil {
 		return result, parseErr
 	}
diff --git a/internal/agent/pi.go b/internal/agent/pi.go
@@ -158,6 +158,7 @@ func (a *PiAgent) Review(
 
 	cmd := exec.CommandContext(ctx, a.Command, args...)
 	cmd.Dir = repoPath
+	tracker := configureSubprocess(cmd)
 
 	// Capture stdout for the result
 	var stdoutBuf bytes.Buffer
@@ -174,6 +175,9 @@ func (a *PiAgent) Review(
 	}
 
 	if err := cmd.Run(); err != nil {
+		if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {
+			return "", ctxErr
+		}
 		return "", fmt.Errorf("pi failed: %w\nstderr: %s", err, stderrBuf.String())
 	}
 
diff --git a/internal/agent/process.go b/internal/agent/process.go
diff --git a/internal/agent/process_test.go b/internal/agent/process_test.go
diff --git a/internal/daemon/joblog.go b/internal/daemon/joblog.go
diff --git a/internal/daemon/joblog_test.go b/internal/daemon/joblog_test.go
diff --git a/internal/daemon/worker.go b/internal/daemon/worker.go

Original file line number	Diff line number	Diff line change
`@@ -85,6 +85,7 @@ func (a *CopilotAgent) Review(ctx context.Context, repoPath, commitSHA, prompt s`
`85`	`85`	`cmd := exec.CommandContext(ctx, a.Command, args...)`
`86`	`86`	`cmd.Stdin = strings.NewReader(prompt)`
`87`	`87`	`cmd.Dir = repoPath`
	`88`	`+ tracker := configureSubprocess(cmd)`
`88`	`89`
`89`	`90`	`var stdout, stderr bytes.Buffer`
`90`	`91`	`if sw := newSyncWriter(output); sw != nil {`
`@@ -96,6 +97,9 @@ func (a *CopilotAgent) Review(ctx context.Context, repoPath, commitSHA, prompt s`
`96`	`97`	`}`
`97`	`98`
`98`	`99`	`if err := cmd.Run(); err != nil {`
	`100`	`+ if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {`
	`101`	`+ return "", ctxErr`
	`102`	`+ }`
`99`	`103`	`return "", fmt.Errorf("copilot failed: %w\nstderr: %s", err, stderr.String())`
`100`	`104`	`}`
`101`	`105`
Original file line number	Diff line number	Diff line change
`@@ -104,6 +104,7 @@ func (a *DroidAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str`
`104`	`104`	`cmd := exec.CommandContext(ctx, a.Command, args...)`
`105`	`105`	`cmd.Dir = repoPath`
`106`	`106`	`cmd.Stdin = strings.NewReader(prompt)`
	`107`	`+ tracker := configureSubprocess(cmd)`
`107`	`108`
`108`	`109`	`var stdout, stderr bytes.Buffer`
`109`	`110`	`cmd.Stdout = &stdout`
`@@ -114,6 +115,9 @@ func (a *DroidAgent) Review(ctx context.Context, repoPath, commitSHA, prompt str`
`114`	`115`	`}`
`115`	`116`
`116`	`117`	`if err := cmd.Run(); err != nil {`
	`118`	`+ if ctxErr := contextProcessError(ctx, tracker, err, nil); ctxErr != nil {`
	`119`	`+ return "", ctxErr`
	`120`	`+ }`
`117`	`121`	`return "", fmt.Errorf("droid failed: %w\nstderr: %s", err, stderr.String())`
`118`	`122`	`}`
`119`	`123`