fix: safe opcache_reset#2073
Conversation
# Conflicts: # caddy/caddy_test.go
|
Hi @AlliBalliBaba, any news on this side? 🙂 |
|
I experimented a lot with juggling thread states, but not sure it's possible to make a reset fully safe from our side alone. There are 2 main race conditions:
The first race condition kind of gets mitigated by the changes in this PR. THe second one would need to somehow be fixed in php-src or requires a hook to reject opcache_resets. |
|
Can we try doing what we do with the environment functions and simply override/replace the existing one:
|
|
Yeah I tried that as well. But an But maybe overwriting the user function would still be a good first step 👍 |
I tried this and it kind of works, but workers are an issue. Only got it working by locking new requests, draining all workers, doing the reset, restarting all workers and then unlocking. |
|
Oh I just realized that overwriting |
|
Yes, I hard linked because opcache is always active in 8.5+ (which is where I test). I'll tidy up my changes and merge them on top of this branch when I get to it. |
|
Maybe we can make it a requirement also for other PHP versions, not sure how many installations there are without it. |
|
You already had logic to check whether opcache exists in the branch so I only had to pick in a minor change here. Let's see if tests pass. Edit: well, apparently not. I only tested with 8.5 locally and guess what's passing x) Okay, I think I get the issue now. Hard linking to opcache won't help because the init chain for shared extensions is different than for static extensions, which is why it works for me locally, but not in CI in < 8.5. Need to wait with actually calling opcache reset until the extension is properly loaded. |
e54cd96 to
5c8e340
Compare
5c8e340 to
1d75824
Compare
e7bd25a to
0d87765
Compare
|
Alright, we're down to php 8.2 segfaulting. I'm not terribly keep to find out why. |
|
Down to only debian 8.2 failing... |
I looked at the last 3 runs on the main branch and they were green with the test passing, so I thought it had to be connected to this branch. Extremely annoying, but kind of my own fault for not remembering them. I've removed the 8.2 specific workarounds here so it does the safe session reset for every php version. Only 9 months to go for 8.2 support anyway... |
| func RestartWorkers() { | ||
| // disallow scaling threads while restarting workers | ||
| scalingMu.Lock() | ||
| defer scalingMu.Unlock() | ||
|
|
||
| threadsToRestart := drainWorkerThreads() | ||
|
|
||
| for _, thread := range threadsToRestart { | ||
| thread.drainChan = make(chan struct{}) | ||
| thread.state.Set(state.Ready) | ||
| } | ||
| restartThreadsAndOpcacheReset(false) | ||
| } |
There was a problem hiding this comment.
I wonder if restartWorkers should also restart regular threads, I suspect that might be what's causing #2265
There was a problem hiding this comment.
I also want to try under which conditions opcache_reset is not strictly necessary with restart_workers, something for a different PR.
Signed-off-by: Marc <m@pyc.ac>
|
Final test for this branch would be the moodle installation from #1737 if you get around to try it. Otherwise I can try setting it up again. |
|
I won't have the time for that this week, would be great if you could pick it up! |
It no longer crashes and the container keeps running, but with so many opcache resets scheduled it times some requests out. |
henderkes
left a comment
There was a problem hiding this comment.
@dunglas @withinboredom @alexandre-daubois ready for review now.
|
I have yet to review the code more deeply but I can confirm this fixed my issue! |
Hmm requests timing out means there might be deadlock somewhere.. Maybe we can also wait for the logic from #2292 to completely reset all threads as an alternate approach. |
Oh no, I don't think so. I usually set timeout to 5s while testing. Not sure why the moodle installation takes so much time at the start, once it's installed nothing times out. I'll see if increasing timeout from a fresh installation still runs into the timeouts, but I don't think so. |
|
Hmm when increasing the number of requests to 1000 in the OpcacheTest I'm also getting the deadlock, I think you've discovered yet another race condition 😩. Not in our implementation, but probably opcache. At this point I think skipping opcache and doing a clean thread reset might be the better approach. |
|
The deadlock just came down to our tests. When an assertion failed it never called wg.Done(). I've increased the timeout to accommodate for the opcache resets + sleeps and it works now. |
PR php#2073 drains all threads for explicit opcache_reset() calls, but WordPress triggers implicit restarts via opcache_invalidate() filling opcache memory → zend_accel_schedule_restart() → restart_pending → the actual reset runs during the next php_request_startup(). That path bypasses the opcache_reset() override and races on shared memory. Add a pthread_rwlock around the request lifecycle: - Normal requests: read lock (concurrent, single atomic CAS) - When zend_accel_schedule_restart_hook fires: set an atomic flag - Next php_request_startup(): sees flag → write lock (exclusive), blocks until all current requests complete, resets safely - After startup: unlock, all threads resume Combined with PR php#2073, this covers both explicit opcache_reset() (thread drain) and implicit OOM/hash-overflow restarts (rwlock). Testing: without this patch, crashes every ~10 min on a WordPress multisite. With PR php#2073 alone, same crash frequency. With the rwlock from our PR php#2349 alone, crashes reduced to every 6-12 hours (worker threads not covered). Combined fix expected to eliminate both paths.
| numRequests := 1000 | ||
| wg.Add(numRequests) | ||
| for i := 0; i < numRequests; i++ { | ||
|
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR aims to make opcache_reset() safe under concurrency by funneling resets through a coordinated thread restart mechanism and ensuring only one actual opcache reset runs at a time.
Changes:
- Centralize worker draining/restarting into shared
drainThreads/restartThreadsAndOpcacheReset. - Introduce new state machine phases to coordinate opcache reset across threads.
- Add a concurrency-focused integration test and PHP fixtures for triggering
opcache_reset().
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| worker.go | Routes draining/restarting through new shared helpers. |
| threadworker.go | Moves opcache reset flow into a coordinated state sequence; removes direct reset call. |
| threadregular.go | Adds coordinated opcache reset flow for regular threads (and imports C). |
| internal/state/state.go | Adds new thread states for opcache reset coordination. |
| frankenphp.go | Implements global coordination: draining, restart, once-only reset, and C-export hook. |
| frankenphp.c | Overrides opcache_reset() to schedule a safe reset; changes reset implementation to call original handler. |
| testdata/require.php | Adds fixture for php require during opcache reset test. |
| testdata/opcache_reset.php | Adds fixture endpoint that calls opcache_reset(). |
| caddy/caddy_test.go | Adds concurrency test attempting to reproduce opcache reset segfault race. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| opcacheResetWg.Go(func() { | ||
| thread.state.WaitFor(state.OpcacheResettingDone) | ||
| }) |
|
|
||
| /* Forward declaration */ | ||
| PHP_FUNCTION(frankenphp_opcache_reset); | ||
| zif_handler orig_opcache_reset; |
| static void frankenphp_override_opcache_reset(void) { | ||
| zend_function *func = zend_hash_str_find_ptr( | ||
| CG(function_table), "opcache_reset", sizeof("opcache_reset") - 1); | ||
| if (func != NULL && func->type == ZEND_INTERNAL_FUNCTION && | ||
| ((zend_internal_function *)func)->handler != | ||
| ZEND_FN(frankenphp_opcache_reset)) { | ||
| orig_opcache_reset = ((zend_internal_function *)func)->handler; | ||
| ((zend_internal_function *)func)->handler = | ||
| ZEND_FN(frankenphp_opcache_reset); | ||
| } | ||
| } |
|
|
||
| threadsToRestart := drainThreads(withRegularThreads) | ||
|
|
||
| opcacheResetOnce = sync.Once{} |
| if threadsAreRestarting.CompareAndSwap(false, true) { | ||
| go restartThreadsAndOpcacheReset(true) | ||
| } |
| // introduce some random delay | ||
| if rand.IntN(10) > 8 { | ||
| time.Sleep(time.Millisecond * 10) | ||
| } | ||
|
|
||
| go func() { | ||
| defer wg.Done() | ||
| // randomly call opcache_reset | ||
| if rand.IntN(10) > 7 { |
If you don't mind tackling the copilot flagged issues, please. I'm working on a different project atm. |
Idea to fix #1737
Just a WIP, to test in CI.