Skip to content

docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035

Open
immu4989 wants to merge 1 commit into
py-why:mainfrom
immu4989:docs-causalforestdml-reproducibility
Open

docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035
immu4989 wants to merge 1 commit into
py-why:mainfrom
immu4989:docs-causalforestdml-reproducibility

Conversation

@immu4989

@immu4989 immu4989 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Closes #971 and #962.

Motivation

Two open user issues report nondeterministic CausalForestDML outputs:

  • #962: "Despite setting fixed parameters, yet the CausalForestDML gets different results each time it is run? Tried setting a random seed, which also failed to effectively solve the problem."
  • #971: "we get different prediction results every time we run the causal forest model... The model was pre-trained, saved and then loaded for predictions."

I verified on main that CausalForestDML is fully reproducible — across repeated effect/effect_interval calls, pickle round-trips, and independent re-fits — when an integer random_state is passed at construction. I tested 8 configurations (n_jobs ∈ {None, 1, -1, 4}, 'auto' first-stage, discrete treatment, varied cv); all show max|Δ| = 0.

The two reporters' likely root causes:

  1. Setting only numpy.random.seed(...) globally instead of passing random_state=<int> to the estimator. The estimator owns its own seeded RNG path and does not consult the numpy global seed.
  2. Constructing without random_state set (default None), in which case results are intentionally stochastic across fits.

Both are documentable as "missing reproducibility requirement," not code bugs.

What changed

Adds a Notes -> Reproducibility section to the CausalForestDML docstring, immediately before Examples. Follows the numpy-doc section ordering convention. The note is anchored next to the existing random_state parameter documentation so users see both at once.

Scope question

The same reproducibility requirement applies to the other DML estimators (LinearDML, SparseLinearDML, NonParamDML) that thread random_state through _make_first_stage_selector. I kept this PR scoped to CausalForestDML since both open issues specifically reference it. Happy to add the same note to the other DML docstrings in this PR (or a follow-up) if you'd prefer broader coverage.

…, py-why#962)

Two open issues report nondeterministic CausalForestDML predictions
across runs (py-why#962) and across save/load cycles (py-why#971). Both can be
resolved by passing an integer random_state at construction; setting
numpy.random.seed alone is not sufficient because the estimator uses
its own seeded RNG for cross-fitting splits and forest subsampling.

Add a Reproducibility note under a new Notes section in the docstring
so the requirement is discoverable next to the existing random_state
parameter documentation, before the Examples block.

Signed-off-by: Imran Ahamed <immu4989@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CausalForest DML Randomness

1 participant