Skip to content

feat!: Gate pipeline deserialization through a module allowlist#11432

Open
bogdankostic wants to merge 6 commits into
v3from
yaml_deserialization
Open

feat!: Gate pipeline deserialization through a module allowlist#11432
bogdankostic wants to merge 6 commits into
v3from
yaml_deserialization

Conversation

@bogdankostic
Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

Pipeline.load / Pipeline.loads / Pipeline.from_dict used to dynamically import any class referenced in the YAML via importlib.import_module, which made a crafted YAML capable of causing arbitrary classes to be imported and instantiated (e.g. subprocess.Popen). This PR gates every import-by-name through an allowlist of trusted module namespaces.

Default allowlist: haystack, haystack_integrations, haystack_experimental, builtins, typing, collections.

Four ways to extend it, in increasing scope:

Scope API
Per call Pipeline.load(fp, allowed_modules=["mypkg.*"])
Per call (bypass) Pipeline.load(fp, unsafe=True)
Process-wide from haystack.core.serialization import allow_deserialization_module
Environment HAYSTACK_DESERIALIZATION_ALLOWLIST="mypkg.*,otherpkg.*"

The gate is wired into every string-to-class entry point:

  • import_class_by_name (used by default_from_dict for nested types)
  • deserialize_type (type annotations)
  • deserialize_callable (function references)
  • Pipeline.from_dict's component-type lookup

The per-call kwargs are propagated to all functions in the deserialization chain via a ContextVar (_DeserializationContext), so existing signatures don't change.

Defense in depth — parameter-name check: default_from_dict now refuses to recurse into nested {"type": "..."} dictionaries whose key is not an __init__ parameter of the parent class. This blocks YAML that smuggles classes into unused parameter slots — even classes on the allowlist can't be instantiated.

How did you test it?

New test file test/core/test_serialization_security.py (39 tests) covering:

  • Pattern-matching semantics for _module_matches
  • The default allowlist (haystack, typing, collections, builtins) and rejection of common attack-surface modules (subprocess, os).
  • All four extension mechanisms (per-call kwarg, process-wide, env var, unsafe=True bypass) — both that they enable the right modules and that they reset cleanly.
  • Propagation through Pipeline.load / loads / from_dict

Updated existing tests in test/core/test_serialization.py, test/core/pipeline/test_pipeline_base.py, four test_*_nonexisting_docstore tests in test/components/, and test/utils/test_callable_serialization.py to use modules that pass the allowlist where the original intent was to test the
import-failure path.

Test infrastructure (test/conftest.py): the autouse fixture extends the process-wide allowlist with test_*, *.test_*, test.*, pydantic, and httpx — i.e. only the modules existing tests legitimately reference.

Notes for the reviewer

  • MIGRATION.md has a copy-pasteable entry covering the four extension paths.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
haystack-docs Ready Ready Preview, Comment Jun 2, 2026 2:07pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/core
  serialization.py 343-344
  serialization_security.py
  haystack/core/pipeline
  base.py
  haystack/utils
  callable_serialization.py
  type_serialization.py 195
Project Total  

This report was generated by python-coverage-comment-action

@bogdankostic bogdankostic marked this pull request as ready for review June 2, 2026 14:05
@bogdankostic bogdankostic requested review from a team as code owners June 2, 2026 14:05
@bogdankostic bogdankostic requested review from julian-risch and removed request for a team June 2, 2026 14:05
Copy link
Copy Markdown
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic Thanks for opening this PR! I like that the release note and migration.md are detailed. I need some more time to look into the entire PR and I suggest we add another reviewer too.
Two things I want us to look more into.

A) custom from_dict implementations that do not route through the gated helpers
We need to make clear to users that they need to go through the gated helpers in their custom implementations. For example, we should then track a documentation change in https://github.com/deepset-ai/haystack-private/issues/381
Plus, we need to ensure integrations or other custom components on platform adhere to that new rule. Quick code search in integrations doesn't look too bad:

  1. google_vertex — VertexAITextGenerator.from_dict (/integrations/google_vertex/src/haystack_integrations/components/generators/google_vertex/text_generator.py#L108)
module_name, class_name = grounding_source["type"].rsplit(".", 1)
module = importlib.import_module(module_name)
data["init_parameters"]["grounding_source"] = getattr(module, class_name)(**grounding_source["init_parameters"])
  1. ragas — _deserialize_metric (/integrations/ragas/src/haystack_integrations/components/evaluators/ragas/utils.py#L53)
module_path, class_name = type_path.rsplit(".", 1)
metric_cls = getattr(importlib.import_module(module_path), class_name)
...
return metric_cls(**kwargs)

  1. dspy — DSPyChatGenerator._deserialize_signature (/integrations/dspy/src/haystack_integrations/components/generators/dspy/chat/chat_generator.py#L200)
module_path, class_name = value.rsplit(".", 1)
module = importlib.import_module(module_path)
return getattr(module, class_name)

B) builtins in the default allowlist
DEFAULT_ALLOWED_MODULES includes builtins (/haystack/core/serialization_security.py#L362)), which contains eval, exec, compile, import, and getattr.

If we keep builtins in the default allowlist then having a denylist with the callables (eval, exec, compile, import) and adding a test that asserts builtins.eval / subprocess.Popen or similar are blocked would be good.

@julian-risch julian-risch self-requested a review June 4, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants