[Python] Add PCollection.with_side_outputs for chainable side outputs#38268
[Python] Add PCollection.with_side_outputs for chainable side outputs#38268hjtran wants to merge 4 commits intoapache:masterfrom
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Apache Beam Python SDK by enabling PCollections to carry named side outputs while remaining chainable. This solves a long-standing usability issue where transform authors had to choose between returning a chainable PCollection or a container that preserves side outputs. The changes introduce a new API for attaching side outputs and ensure these are correctly registered within the pipeline graph, maintaining compatibility with existing runner APIs and composite transform structures. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Rename _FilterWithSideOutputs/_SplitOddDoFn helpers in pipeline_test.py
to RemoveEvens/_RemoveEvensDoFn for clarity, and flip the DoFn semantics
so the name matches the behavior (evens go to the 'dropped' side output).
Drop two tests that were not pulling their weight:
- test_pcollection_side_outputs_wraps_with_outputs: fully subsumed by
test_pcollection_side_outputs_end_to_end, which already asserts the
applied transform's outputs contain both tags.
- test_pcollection_side_outputs_allows_idempotent_tag_collision: only
reachable by reaching past the public API to set _side_outputs
manually; the defensive impl branch is still worth keeping but the
test is not.
Merge the two pvalue_test.py validation tests into a single
test_with_side_outputs_validation.
Summary
Adds
PCollection.with_side_outputs(**kwargs)to the Python SDK so transforms can return a single, chainablePCollectionthat also carries named side-output PCollections, accessible viaresult.side_outputs.<tag>.Today, transform authors face a tradeoff between returning a
PCollection(chainable, but loses side outputs) or adict/PCollectionTuple(preserves side outputs, but breaks chaining). This change lets them have both.Design
_SideOutputsContainerexposes attribute (container.dropped) and item (container["dropped"]) access plus iteration /len/in.PCollection.with_side_outputs(**kwargs)returns acopy.copy(self)with the side-output mapping attached. The original is unchanged. Validation: each value must be aPCollection(TypeError) on the same pipeline (ValueError).Pipeline._apply_internalandPipeline._replaceboth call a shared helper that registers the side outputs on the wrappingAppliedPTransform— only for the top-level returnedPCollection. This makes side outputs first-class in the pipeline graph (visible to runners, the proto, visualization, and YAML'sTransform.tagresolution).ValueError.PCollectionobject (idempotent); otherwiseValueError.__copy__is added toPCollectionsocopy.copy(self)doesn't dispatch to the existing__reduce_ex__anti-pickling hook.Chaining behavior
pc | A | Bis unchanged.BreceivesA's main output as a normal PCollection.A's side outputs are not carried forward — capture the intermediate result if you need them. This matches the behavior described in the proposal and requires no special code.Caveats
**kwargssyntax restricts tags to valid Python identifiers. Tags with hyphens/spaces are not supported here; users with arbitrary string tags should keep using the existing dict /DoOutputsTuplepatterns..side_outputsis a construction-time ergonomic, not a durable graph property. Afterfrom_runner_api, the reconstructed PCollection won't have_side_outputspopulated, but the side outputs themselves still exist as named outputs on the producer'sAppliedPTransform.DoOutputsTuple).Follow-ups (not in this PR)
ParDo.with_side_outputs(...)convenience method.get_pcollectionalready resolvesTransform.tagby walking the producer's outputs dict, so this should mostly "just work" once verified).Tests
pvalue_test.pycovers: copy semantics, attribute / index access, missing-tag errors, type/pipeline validation, empty container, double-call replace.pipeline_test.pycovers: end-to-end materialization withassert_that, the canonicalMyFilter-style composite wrappingwith_outputs, foreign-pcollection rejection, tag-collision rejection, idempotent same-object collision, replacement-path registration viaPipeline.replace_all, runner API round-trip, and the nested-return non-flattening case.Test plan
cd sdks/python && python -m pytest apache_beam/pvalue_test.py apache_beam/pipeline_test.py— 84 passed, 1 skippedruff check --config=sdks/python/ruff.tomlon the four changed files — cleanyapf --diffon the four changed files — clean