Commit 8c707e0
authored
Harden job logging and timeout cleanup (#476)
## Summary
This PR hardens two failure paths that show up under resource pressure:
1. Per-job log files could be missing even though the job appeared in
the TUI.
2. Jobs could appear to run well past `job_timeout_minutes` because the
worker context timed out but the agent subprocess/pipe handling did not
always unwind promptly.
The primary goal here is reliability, not changing policy. The existing
default 30 minute timeout remains in place; this change makes that
timeout and per-job logging more dependable.
## What can go wrong today
### Scenario 1: log file creation fails when the job starts
Per-job logging was best-effort. We attempted to open
`~/.roborev/logs/jobs/<id>.log` once at job start. If that failed
because of a transient filesystem or resource problem, roborev logged a
warning and permanently disabled disk logging for that job.
On a resource-constrained machine, that can happen if:
- the machine is under RAM pressure and filesystem operations stall or
fail transiently
- the data directory is temporarily unavailable or in a bad state
- writes start failing mid-job and never recover into a reopened log
file
The visible result is a job in the TUI with no corresponding job log
file, even though the job itself may finish.
### Scenario 2: the job timeout fires, but the review does not actually
unwind
The worker already applies a hard timeout via
`context.WithTimeout(...)`, defaulting to 30 minutes. The bug is that
some agent subprocess paths could still remain stuck in process wait /
pipe handling after the context deadline.
That means a user can observe a review job apparently "running" for
longer than the configured timeout, especially when subprocesses are
slow to terminate or inherited pipes remain open.
On low-resource machines this is more likely because:
- child processes can stall while under memory pressure
- cleanup can take longer
- stdout pipe / background process behavior can keep `Wait()` from
returning promptly
## What this PR changes
### 1. Retryable job log writer
Replace the old one-shot `safeWriter` behavior with a retrying
`jobLogWriter`:
- it keeps trying to open/reopen the per-job log file after transient
failures
- it buffers a bounded amount of output in memory while disk logging is
unavailable
- it flushes buffered content once logging becomes available again
- if the buffer overflows, it records how many bytes were dropped
instead of silently disabling logging forever
This means a transient log-file open/write failure no longer dooms the
entire job to have no disk log.
### 2. Centralized subprocess wait configuration and timeout cleanup
Add shared subprocess helpers so all agent adapters use the same
wait-delay handling.
For streaming adapters, close the stdout pipe when the job context is
done. This helps break cases where the context deadline has fired but
the reader / wait path is still blocked behind lingering pipe state.
After `cmd.Wait()` completes, return `context.DeadlineExceeded` when the
job context has expired instead of surfacing a generic subprocess error.
### 3. Explicit timeout handling in the worker
When an agent returns a deadline error, the worker now records a stable
job error like:
`agent timeout after 30m0s`
and sends it through the normal retry/failover path.
This makes timeout failures clearer in the database, TUI, and hooks, and
avoids jobs looking like ambiguous generic agent failures.
## Why this fixes the reported behavior
The reported symptoms were:
- not every displayed job had a log file
- some review jobs seemed to hang for over an hour
- the machine had limited RAM
This PR directly addresses those failure modes:
- transient resource issues no longer permanently disable per-job disk
logging
- timed-out agent jobs are much more likely to unwind promptly instead
of lingering past the configured timeout
## Validation
- `go fmt ./...`
- `go vet ./...`
- `go test ./...`
## Notes from local investigation
On the investigated machine there are more jobs in `reviews.db` than
files in `~/.roborev/logs/jobs`, but the mismatch is historical rather
than current: jobs from the last 7 days all had logs. The older missing
logs line up with the existing log-retention/cleanup behavior and prior
best-effort logging behavior. This PR is aimed at preventing new gaps
under transient resource pressure and making timeout enforcement behave
consistently.1 parent e294ff9 commit 8c707e0
16 files changed
Lines changed: 747 additions & 78 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | 15 | | |
17 | 16 | | |
18 | 17 | | |
| |||
135 | 134 | | |
136 | 135 | | |
137 | 136 | | |
138 | | - | |
| 137 | + | |
139 | 138 | | |
140 | 139 | | |
141 | 140 | | |
| |||
153 | 152 | | |
154 | 153 | | |
155 | 154 | | |
| 155 | + | |
| 156 | + | |
156 | 157 | | |
157 | 158 | | |
158 | 159 | | |
| |||
166 | 167 | | |
167 | 168 | | |
168 | 169 | | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
169 | 173 | | |
170 | 174 | | |
171 | 175 | | |
| |||
186 | 190 | | |
187 | 191 | | |
188 | 192 | | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
189 | 197 | | |
190 | 198 | | |
191 | 199 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | 14 | | |
16 | 15 | | |
17 | 16 | | |
| |||
196 | 195 | | |
197 | 196 | | |
198 | 197 | | |
199 | | - | |
| 198 | + | |
200 | 199 | | |
201 | 200 | | |
202 | 201 | | |
| |||
210 | 209 | | |
211 | 210 | | |
212 | 211 | | |
| 212 | + | |
| 213 | + | |
213 | 214 | | |
214 | 215 | | |
215 | 216 | | |
| |||
225 | 226 | | |
226 | 227 | | |
227 | 228 | | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
228 | 232 | | |
229 | 233 | | |
230 | 234 | | |
231 | 235 | | |
232 | 236 | | |
233 | 237 | | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
234 | 242 | | |
235 | 243 | | |
236 | 244 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
116 | 117 | | |
117 | 118 | | |
118 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
119 | 153 | | |
120 | 154 | | |
121 | 155 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
| |||
96 | 97 | | |
97 | 98 | | |
98 | 99 | | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
99 | 103 | | |
100 | 104 | | |
101 | 105 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
12 | 11 | | |
13 | 12 | | |
14 | 13 | | |
| |||
112 | 111 | | |
113 | 112 | | |
114 | 113 | | |
115 | | - | |
| 114 | + | |
116 | 115 | | |
117 | 116 | | |
118 | 117 | | |
119 | 118 | | |
120 | 119 | | |
121 | 120 | | |
122 | 121 | | |
| 122 | + | |
| 123 | + | |
123 | 124 | | |
124 | 125 | | |
125 | 126 | | |
| |||
130 | 131 | | |
131 | 132 | | |
132 | 133 | | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
133 | 137 | | |
134 | 138 | | |
135 | 139 | | |
136 | 140 | | |
137 | 141 | | |
138 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
139 | 147 | | |
140 | 148 | | |
141 | 149 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
| 107 | + | |
107 | 108 | | |
108 | 109 | | |
109 | 110 | | |
| |||
114 | 115 | | |
115 | 116 | | |
116 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
117 | 121 | | |
118 | 122 | | |
119 | 123 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | 13 | | |
15 | 14 | | |
16 | 15 | | |
| |||
117 | 116 | | |
118 | 117 | | |
119 | 118 | | |
120 | | - | |
| 119 | + | |
121 | 120 | | |
122 | 121 | | |
123 | 122 | | |
| |||
130 | 129 | | |
131 | 130 | | |
132 | 131 | | |
| 132 | + | |
| 133 | + | |
133 | 134 | | |
134 | 135 | | |
135 | 136 | | |
| |||
145 | 146 | | |
146 | 147 | | |
147 | 148 | | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
148 | 152 | | |
149 | 153 | | |
150 | 154 | | |
151 | 155 | | |
152 | 156 | | |
153 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
154 | 162 | | |
155 | 163 | | |
156 | 164 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| 109 | + | |
109 | 110 | | |
110 | 111 | | |
111 | 112 | | |
| |||
120 | 121 | | |
121 | 122 | | |
122 | 123 | | |
| 124 | + | |
| 125 | + | |
123 | 126 | | |
124 | 127 | | |
125 | 128 | | |
| |||
140 | 143 | | |
141 | 144 | | |
142 | 145 | | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
143 | 149 | | |
144 | 150 | | |
145 | 151 | | |
| |||
165 | 171 | | |
166 | 172 | | |
167 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
168 | 178 | | |
169 | 179 | | |
170 | 180 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
12 | 11 | | |
13 | 12 | | |
14 | 13 | | |
| |||
150 | 149 | | |
151 | 150 | | |
152 | 151 | | |
153 | | - | |
| 152 | + | |
154 | 153 | | |
155 | 154 | | |
156 | 155 | | |
| |||
159 | 158 | | |
160 | 159 | | |
161 | 160 | | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
162 | 164 | | |
163 | 165 | | |
164 | 166 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
| 96 | + | |
96 | 97 | | |
97 | 98 | | |
98 | 99 | | |
| |||
109 | 110 | | |
110 | 111 | | |
111 | 112 | | |
| 113 | + | |
| 114 | + | |
112 | 115 | | |
113 | 116 | | |
114 | 117 | | |
| |||
124 | 127 | | |
125 | 128 | | |
126 | 129 | | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
127 | 133 | | |
128 | 134 | | |
129 | 135 | | |
| |||
142 | 148 | | |
143 | 149 | | |
144 | 150 | | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
145 | 155 | | |
146 | 156 | | |
147 | 157 | | |
| |||
0 commit comments