docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035
Open
immu4989 wants to merge 1 commit into
Open
docs: document CausalForestDML reproducibility requirements (closes #971, #962)#1035immu4989 wants to merge 1 commit into
immu4989 wants to merge 1 commit into
Conversation
…, py-why#962) Two open issues report nondeterministic CausalForestDML predictions across runs (py-why#962) and across save/load cycles (py-why#971). Both can be resolved by passing an integer random_state at construction; setting numpy.random.seed alone is not sufficient because the estimator uses its own seeded RNG for cross-fitting splits and forest subsampling. Add a Reproducibility note under a new Notes section in the docstring so the requirement is discoverable next to the existing random_state parameter documentation, before the Examples block. Signed-off-by: Imran Ahamed <immu4989@gmail.com>
This was referenced Jun 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #971 and #962.
Motivation
Two open user issues report nondeterministic
CausalForestDMLoutputs:I verified on
mainthatCausalForestDMLis fully reproducible — across repeatedeffect/effect_intervalcalls, pickle round-trips, and independent re-fits — when an integerrandom_stateis passed at construction. I tested 8 configurations (n_jobs∈ {None, 1, -1, 4},'auto'first-stage, discrete treatment, variedcv); all showmax|Δ| = 0.The two reporters' likely root causes:
numpy.random.seed(...)globally instead of passingrandom_state=<int>to the estimator. The estimator owns its own seeded RNG path and does not consult the numpy global seed.random_stateset (defaultNone), in which case results are intentionally stochastic across fits.Both are documentable as "missing reproducibility requirement," not code bugs.
What changed
Adds a
Notes -> Reproducibilitysection to theCausalForestDMLdocstring, immediately beforeExamples. Follows the numpy-doc section ordering convention. The note is anchored next to the existingrandom_stateparameter documentation so users see both at once.Scope question
The same reproducibility requirement applies to the other DML estimators (
LinearDML,SparseLinearDML,NonParamDML) that threadrandom_statethrough_make_first_stage_selector. I kept this PR scoped toCausalForestDMLsince both open issues specifically reference it. Happy to add the same note to the other DML docstrings in this PR (or a follow-up) if you'd prefer broader coverage.