docs: document CausalForestDML reproducibility requirements (closes #971, #962) by immu4989 · Pull Request #1035 · py-why/EconML

immu4989 · 2026-06-06T00:29:43Z

Closes #971 and #962.

Motivation

Two open user issues report nondeterministic CausalForestDML outputs:

#962: "Despite setting fixed parameters, yet the CausalForestDML gets different results each time it is run? Tried setting a random seed, which also failed to effectively solve the problem."
#971: "we get different prediction results every time we run the causal forest model... The model was pre-trained, saved and then loaded for predictions."

I verified on main that CausalForestDML is fully reproducible — across repeated effect/effect_interval calls, pickle round-trips, and independent re-fits — when an integer random_state is passed at construction. I tested 8 configurations (n_jobs ∈ {None, 1, -1, 4}, 'auto' first-stage, discrete treatment, varied cv); all show max|Δ| = 0.

The two reporters' likely root causes:

Setting only numpy.random.seed(...) globally instead of passing random_state=<int> to the estimator. The estimator owns its own seeded RNG path and does not consult the numpy global seed.
Constructing without random_state set (default None), in which case results are intentionally stochastic across fits.

Both are documentable as "missing reproducibility requirement," not code bugs.

What changed

Adds a Notes -> Reproducibility section to the CausalForestDML docstring, immediately before Examples. Follows the numpy-doc section ordering convention. The note is anchored next to the existing random_state parameter documentation so users see both at once.

Scope question

The same reproducibility requirement applies to the other DML estimators (LinearDML, SparseLinearDML, NonParamDML) that thread random_state through _make_first_stage_selector. I kept this PR scoped to CausalForestDML since both open issues specifically reference it. Happy to add the same note to the other DML docstrings in this PR (or a follow-up) if you'd prefer broader coverage.

…, py-why#962) Two open issues report nondeterministic CausalForestDML predictions across runs (py-why#962) and across save/load cycles (py-why#971). Both can be resolved by passing an integer random_state at construction; setting numpy.random.seed alone is not sufficient because the estimator uses its own seeded RNG for cross-fitting splits and forest subsampling. Add a Reproducibility note under a new Notes section in the docstring so the requirement is discoverable next to the existing random_state parameter documentation, before the Examples block. Signed-off-by: Imran Ahamed <immu4989@gmail.com>

This was referenced Jun 6, 2026

CausalForest DML Randomness #971

Open

Despite setting fixed parameters, yet the CausalForestDML gets different results each time it is run? Is there a problem with this? #962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035

docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035
immu4989 wants to merge 1 commit into
py-why:mainfrom
immu4989:docs-causalforestdml-reproducibility

immu4989 commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

immu4989 commented Jun 6, 2026

Motivation

What changed

Scope question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant