feat(agent): use context.WithCancel to handle graceful shutdowns#613
feat(agent): use context.WithCancel to handle graceful shutdowns#613dajneem23 wants to merge 2 commits intouber:masterfrom
Conversation
…ling; add logging context utilities
| }() | ||
|
|
||
| go func() { | ||
| if err := nginx.Run(config.Nginx, map[string]interface{}{ |
There was a problem hiding this comment.
@sambhav-jain-16 Should we add a nginx.RunContext variant (accepting a context.Context) to support graceful shutdown ?
There was a problem hiding this comment.
Pull request overview
This PR attempts to make the agent daemon's startup/runtime path context-aware so fatal conditions and OS signals can trigger shutdown through agent/cmd.Run instead of calling log.Fatal from deep inside the command stack. It fits into the agent entrypoint and command runner, which are responsible for wiring up the agent HTTP server, registry, nginx, metrics, and supporting background work.
Changes:
- Add signal-aware root-context handling in
agent/main.goand switchcmd.Runto return an error. - Refactor
agent/cmd.Runto usecontext.WithCancel, replace many internallog.Fatalcalls with returned errors, and introducewaitForShutdown. - Add unit tests around shutdown coordination helpers and the HTTP server shutdown pattern.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
agent/main.go |
Wraps agent startup in a signal-aware context and handles cmd.Run errors at the process entrypoint. |
agent/cmd/cmd.go |
Refactors the agent runner to use cancelable context flow, background goroutines, and shutdown/error coordination. |
agent/cmd/cmd_test.go |
Adds tests for waitForShutdown behavior and the HTTP server shutdown guard. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| go func() { | ||
| <-ctx.Done() | ||
| if err := httpServer.Shutdown(context.Background()); err != nil { | ||
| log.Errorf("agent server shutdown: %s", err) |
There was a problem hiding this comment.
+1. Also, wouldn't it be better to use golang.org/x/sync/errgroup to run the goroutines this way you don't need to write a different function to wait for shutdown?
There was a problem hiding this comment.
That make sense, but I'm concerned about the registry.ListenAndServe() comes from github.com/docker/distribution and the Registry.server field is unexported with no public Shutdown() method. The only shutdown hook in that package is an internal SIGTERM listener. So errgroup.Wait() would block forever on the registry goroutine during a clean shutdown.
For Nginx, I added nginx.RunContext which sends SIGQUIT on context cancellation, so nginx drains gracefully. httpServer is also covered via the synchronous httpServer.Shutdown() call after waitForShutdown returns.
The registry is the last piece. I'd suggest we track that as a follow-up: either add a Shutdown wrapper to our docker/distribution fork or expose the *http.Server so we can call Shutdown on it externally. Once that's in place, we can replace all of this with a clean errgroup.
Would you be happy if I opened a follow-up issue for the registry shutdown?
| go func() { | ||
| if err := nginx.Run(config.Nginx, map[string]interface{}{ | ||
| "allowed_cidrs": config.AllowedCidrs, | ||
| "port": flags.AgentRegistryPort, | ||
| "registry_server": nginx.GetServer( |
| go func() { | ||
| if err := registry.ListenAndServe(); err != nil { | ||
| stopHeartbeat() | ||
| log.Fatal(err) | ||
| log.Errorf("registry exited: %s", err) | ||
| errCh <- err | ||
| cancel() |
| if err := ctx.Err(); err != nil && err != context.Canceled { | ||
| log.Fatalf("context error: %s", err) |
| go func() { | ||
| if err := nginx.Run(config.Nginx, map[string]interface{}{ | ||
| "allowed_cidrs": config.AllowedCidrs, | ||
| "port": flags.AgentRegistryPort, | ||
| "registry_server": nginx.GetServer( |
|
@sambhav-jain-16 to fix TestRingLocationsDistribution CI, should we increase sampleSize? |
sambhav-jain-16
left a comment
There was a problem hiding this comment.
Good start.
Some comments
| go func() { | ||
| if err := http.ListenAndServe(addr, agentServer.Handler()); err != nil { | ||
| if err := httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed { |
There was a problem hiding this comment.
We should cancel on all errors, right?
There was a problem hiding this comment.
From https://pkg.go.dev/net/http#Server.Shutdown
" Shutdown does not attempt to close nor wait for hijacked connections such as WebSockets. [...] When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return."
http.ErrServerClosed is returned by ListenAndServe exactly when Shutdown() or Close() is called on the server. Here's the flow in this code:
- OS sends SIGTERM/SIGINT → signal.NotifyContext in main.go cancels the top-level ctx
- ctx.Done() fires → the shutdown goroutine calls httpServer.Shutdown(context.Background())
- Shutdown() drains in-flight requests, then causes ListenAndServe to return http.ErrServerClosed
| }() | ||
|
|
||
| go func() { | ||
| if err := nginx.Run(config.Nginx, map[string]interface{}{ |
| func main() { | ||
| cmd.Run(cmd.ParseFlags(), cmd.WithEffect(func() { | ||
| ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) | ||
| defer stop() |
There was a problem hiding this comment.
We shouldn't change the exit code mechanism.
There was a problem hiding this comment.
We shouldn't change the exit code mechanism.
Agreed — and we didn't. The exit code mechanism is unchanged from master. log.Fatal(err) is still the last line in main, so any fatal error still terminates with exit code 1 via os.Exit(1) exactly as before.
cmd.Run needs a context. Context that is cancelled on SIGINT/SIGTERM.
cmd.Run now takes ctx—the context is passed down so internal goroutines (HTTP server, nginx) can shut down gracefully when a signal arrives, instead of being killed abruptly.
Is this acceptable?
There was a problem hiding this comment.
This should also be an error now, right rather than panic?
There was a problem hiding this comment.
This should also be an error now, right rather than panic?
make sense
| go func() { | ||
| <-ctx.Done() | ||
| if err := httpServer.Shutdown(context.Background()); err != nil { | ||
| log.Errorf("agent server shutdown: %s", err) |
There was a problem hiding this comment.
+1. Also, wouldn't it be better to use golang.org/x/sync/errgroup to run the goroutines this way you don't need to write a different function to wait for shutdown?
| addr := fmt.Sprintf(":%d", flags.AgentServerPort) | ||
| log.Infof("Starting agent server on %s", addr) | ||
| log.InfoS("starting agent server", "addr", addr) | ||
| errCh := make(chan error, 2) |
There was a problem hiding this comment.
Use of buffered channel is not recommended, especially in case of errors, I recommend we use errgoup
58508b9 to
6f60d13
Compare
What?
Replace the use of log.Fatal throughout the command runner with a context‑based cancellation mechanism, enabling the Run function to implement a proper “stopper” and perform graceful cleanup before exiting.
Fixes #510
Why?
Currently, any fatal error in cmd.Run or its sub‑functions calls log.Fatal, which immediately terminates the process with os.Exit(1). This prevents deferred cleanup (closing files, flushing logs, draining connections) and makes it impossible to implement a graceful shutdown routine.
The proposed pattern uses context.WithCancel to signal that the application should stop. The Run function can:
Pass the cancellable context to all downstream operations.
Cancel the context upon any fatal error (instead of calling log.Fatal).
Listen for context cancellation (<-ctx.Done()) and execute a controlled teardown sequence before allowing the process to exit normally.
How?
Checklist?
Test coverage