Skip to content

Fix/3849: defer state publish in RunnableLoadJob until _on_completed marke…#3854

Open
ugbotueferhire wants to merge 4 commits intodlt-hub:develfrom
ugbotueferhire:devel
Open

Fix/3849: defer state publish in RunnableLoadJob until _on_completed marke…#3854
ugbotueferhire wants to merge 4 commits intodlt-hub:develfrom
ugbotueferhire:devel

Conversation

@ugbotueferhire
Copy link
Copy Markdown

[Fix] – Eliminate race condition between worker state publish and package finalization in RunnableLoadJob

Description

Context:
A race condition existed in RunnableLoadJob.run_managed() where self._state was published to a terminal value ("completed" or "failed") before the _on_completed callback had finished persisting the pending-transition marker to disk. This created a window where the main loader thread could observe the terminal state via job.state(), proceed to finalize the load package (renaming its directory), and collide with the worker thread still writing the transition marker inside the now-renamed directory. The result was intermittent filesystem errors or silently lost transition markers, potentially compromising data integrity on resume. See #3849.

Approach:
The fix introduces a local terminal_state variable in run_managed() that tracks the intended terminal state throughout execution and the finally block. The key behavioral change is:

  • Deferred state publication: self._state is only assigned after _on_completed() has finished persisting the marker and _finished_at has been set. This ensures the main thread cannot observe a terminal state until all side effects are complete.
  • Semaphore release moved outside the retry guard: The _done_event.release() call is now unconditionally placed after self._state is published, but still gated on non-retry terminal states. This preserves the existing wake-up semantics while guaranteeing correct ordering.
  • No new locks or synchronization primitives: The fix is purely a reordering of existing operations, keeping the change minimal and non-invasive.

Modified file:

  • dlt/common/destination/client.pyRunnableLoadJob.run_managed() method

Execution ordering (before → after):

  # BEFORE (racy)                        # AFTER (safe)
  try:                                    try:
-     self._state = "completed"           +     terminal_state = "completed"
  except ...:                             except ...:
-     self._state = "failed"              +     terminal_state = "failed"
  finally:                                finally:
-     if self._state != "retry":          +     if terminal_state != "retry":
-         self._on_completed(...)          +         self._on_completed(...)
-         self._finished_at = now()        +         self._finished_at = now()
-         self._done_event.release()       +     self._state = terminal_state    # ← publish AFTER marker is safe
+                                          +     if terminal_state != "retry":
+                                          +         self._done_event.release()

Impact

  • Functionality: The main loader thread can no longer finalize a load package while a worker is still writing the transition marker. Resume-after-crash is now fully reliable.
  • Data Integrity: Eliminates the possibility of lost or corrupt transition markers caused by directory renames racing against writes.
  • Performance: Zero overhead — the fix is a pure reordering of assignments, no new locks, no new allocations.
  • Backwards Compatibility: No public API changes. The observable behavior (state transitions, callback ordering, semaphore semantics) is identical; only the internal timing is tightened.

Tests

  • All existing tests pass
  • Dedicated race-condition-aware tests validate the fix

Unit Tests (tests/load/test_jobs.py) — 24/24 PASSED

Test What it validates
test_on_completed_called_on_success _on_completed fires with ("completed", None) on success
test_on_completed_called_on_failure _on_completed fires with ("failed", traceback) on terminal exception
test_on_completed_not_called_on_retry _on_completed is never called on transient exceptions
test_on_completed_exception_propagates Exceptions raised inside _on_completed are not swallowed
test_on_completed_fires_before_semaphore_release _on_completed executes before _finished_at is set and before the semaphore is released — directly validates the ordering guarantee this fix provides
test_runnable_job_results End-to-end state transitions for success, retry, and failure paths
test_set_final_state_* set_final_state() correctly forces terminal states

Integration Tests (tests/load/test_dummy_client.py) — 51/52 PASSED

All functional loader-loop tests pass, including:

Test What it validates
test_completed_loop Full load lifecycle completes correctly
test_resume_with_pending_completed_transition Resume correctly skips re-execution when marker exists
test_resume_with_pending_failed_transition Resume handles failed-state markers
test_load_single_thread Single-threaded loader path
test_load_multiple_packages Multi-package loading
test_spool_job_retry_* Retry mechanics remain correct

VISUALS

image image

Note

test_big_load_packages (1/52 failure) is a pre-existing performance benchmark that expects 1000 empty jobs to complete in under 15 seconds. It timed out at 37s due to local machine load — this is environment-specific and unrelated to this change.

@ugbotueferhire ugbotueferhire changed the title Fix: defer state publish in RunnableLoadJob until _on_completed marke… Fix/3849: defer state publish in RunnableLoadJob until _on_completed marke… Apr 11, 2026
@ugbotueferhire
Copy link
Copy Markdown
Author

ugbotueferhire commented Apr 11, 2026

Hi @Travior , could you review these changes when you have some time? Happy to make any adjustments needed. Thanks for your time!

Copy link
Copy Markdown
Contributor

@Travior Travior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @ugbotueferhire thanks for opening a PR for this.
The main changes look very good, the comments are a bit verbose but not the end of the world.

There are two main concerns that I have:

  1. the changes look to correctly solve the race condition, but we should probably introduce a regression test that reproduces this ordering deterministically. I would look at tests/load/test_jobs.py::test_on_completed_fires_before_semaphore_release and tests/common/configuration/test_container.py::test_container_thread_affinity on how to do tests around load_jobs and how to use threading.Barrier
  2. right now if _on_completed raises (for example when the .dlt folder becomes unavailable) the job stays in a non-terminal state. We need to set a terminal state without swallowing the exception, otherwise execution will halt indefinitely. There is no test that checks this right now, which would be a great addition too

If there is anything that is unclear or you need further help feel free to ping me, thanks again for your work!

@ugbotueferhire
Copy link
Copy Markdown
Author

Hey @Travior, thanks for the feedback 🤙🏾 I've addressed both concerns:

  1. Regression test: Added test_state_not_published_before_on_completed_finishes — uses threading.Barrier (following the test_container_thread_affinity pattern) to deterministically verify that state() stays "running" while _on_completed is executing.

  2. _on_completed exception safety: Wrapped the callback in try/except so if it raises (e.g. .dlt folder unavailable), the job is forced to "failed", _finished_at is set, and the semaphore is released before re-raising. Added two tests for this: test_on_completed_exception_sets_terminal_state and test_on_completed_exception_releases_semaphore. Also updated test_on_completed_exception_propagates to assert terminal state.

All 27 tests in test_jobs.py pass.

@Travior
Copy link
Copy Markdown
Contributor

Travior commented Apr 28, 2026

Thanks for coming back to this. I'll have a look at the tests tomorrow and will come back to you

@ugbotueferhire
Copy link
Copy Markdown
Author

Hey @Travior, reminder incase you forgot, Please Whenever you have a moment, would love to get this over the line. Thanks!

@Travior
Copy link
Copy Markdown
Contributor

Travior commented May 3, 2026

This looks quite good now.
Just one note: When _on_completed raises we override self._exception (even though this might already be populated if the load has failed). We should probably keep the "more valuable" exception from the Destination in this double failure case.
Otherwise this looks really good, tests are good

@ugbotueferhire
Copy link
Copy Markdown
Author

ugbotueferhire commented May 4, 2026

Thanks for flagging this. just updated _on_completed so we only store the callback error if the job hadn’t already failed, which means we now keep the original destination exception in the double-failure case. I also added a regression test for that path.

@Travior
Copy link
Copy Markdown
Contributor

Travior commented May 4, 2026

Thanks a lot, I'll let CI run and check in on this tomorrow morning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

package can be finalized/renamed before job _on_completed(...) finishes writing transition marker

2 participants