feat!: Gate pipeline deserialization through a module allowlist#11432
feat!: Gate pipeline deserialization through a module allowlist#11432bogdankostic wants to merge 6 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
julian-risch
left a comment
There was a problem hiding this comment.
@bogdankostic Thanks for opening this PR! I like that the release note and migration.md are detailed. I need some more time to look into the entire PR and I suggest we add another reviewer too.
Two things I want us to look more into.
A) custom from_dict implementations that do not route through the gated helpers
We need to make clear to users that they need to go through the gated helpers in their custom implementations. For example, we should then track a documentation change in https://github.com/deepset-ai/haystack-private/issues/381
Plus, we need to ensure integrations or other custom components on platform adhere to that new rule. Quick code search in integrations doesn't look too bad:
- google_vertex — VertexAITextGenerator.from_dict (/integrations/google_vertex/src/haystack_integrations/components/generators/google_vertex/text_generator.py#L108)
module_name, class_name = grounding_source["type"].rsplit(".", 1)
module = importlib.import_module(module_name)
data["init_parameters"]["grounding_source"] = getattr(module, class_name)(**grounding_source["init_parameters"])
- ragas — _deserialize_metric (/integrations/ragas/src/haystack_integrations/components/evaluators/ragas/utils.py#L53)
module_path, class_name = type_path.rsplit(".", 1)
metric_cls = getattr(importlib.import_module(module_path), class_name)
...
return metric_cls(**kwargs)
- dspy — DSPyChatGenerator._deserialize_signature (/integrations/dspy/src/haystack_integrations/components/generators/dspy/chat/chat_generator.py#L200)
module_path, class_name = value.rsplit(".", 1)
module = importlib.import_module(module_path)
return getattr(module, class_name)
B) builtins in the default allowlist
DEFAULT_ALLOWED_MODULES includes builtins (/haystack/core/serialization_security.py#L362)), which contains eval, exec, compile, import, and getattr.
If we keep builtins in the default allowlist then having a denylist with the callables (eval, exec, compile, import) and adding a test that asserts builtins.eval / subprocess.Popen or similar are blocked would be good.
Related Issues
Proposed Changes:
Pipeline.load/Pipeline.loads/Pipeline.from_dictused to dynamically import any class referenced in the YAML viaimportlib.import_module, which made a crafted YAML capable of causing arbitrary classes to be imported and instantiated (e.g.subprocess.Popen). This PR gates every import-by-name through an allowlist of trusted module namespaces.Default allowlist:
haystack,haystack_integrations,haystack_experimental,builtins,typing,collections.Four ways to extend it, in increasing scope:
Pipeline.load(fp, allowed_modules=["mypkg.*"])Pipeline.load(fp, unsafe=True)from haystack.core.serialization import allow_deserialization_moduleHAYSTACK_DESERIALIZATION_ALLOWLIST="mypkg.*,otherpkg.*"The gate is wired into every string-to-class entry point:
import_class_by_name(used bydefault_from_dictfor nested types)deserialize_type(type annotations)deserialize_callable(function references)Pipeline.from_dict's component-type lookupThe per-call kwargs are propagated to all functions in the deserialization chain via a
ContextVar(_DeserializationContext), so existing signatures don't change.Defense in depth — parameter-name check:
default_from_dictnow refuses to recurse into nested{"type": "..."}dictionaries whose key is not an__init__parameter of the parent class. This blocks YAML that smuggles classes into unused parameter slots — even classes on the allowlist can't be instantiated.How did you test it?
New test file
test/core/test_serialization_security.py(39 tests) covering:_module_matcheshaystack,typing,collections,builtins) and rejection of common attack-surface modules (subprocess,os).unsafe=Truebypass) — both that they enable the right modules and that they reset cleanly.Pipeline.load/loads/from_dictUpdated existing tests in
test/core/test_serialization.py,test/core/pipeline/test_pipeline_base.py, fourtest_*_nonexisting_docstoretests intest/components/, andtest/utils/test_callable_serialization.pyto use modules that pass the allowlist where the original intent was to test theimport-failure path.
Test infrastructure (
test/conftest.py): the autouse fixture extends the process-wide allowlist withtest_*,*.test_*,test.*,pydantic, andhttpx— i.e. only the modules existing tests legitimately reference.Notes for the reviewer
MIGRATION.mdhas a copy-pasteable entry covering the four extension paths.Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.