[CRCR] Initial implementation of L2#7967
Conversation
|
Hi @KarhouTam! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@KarhouTam is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
ZainRizvi
left a comment
There was a problem hiding this comment.
Some earlier comments plus a couple new ones from cross-referencing with the RFC.
ZainRizvi
left a comment
There was a problem hiding this comment.
A couple more things from cross-referencing with the RFC.
|
Hey, @ZainRizvi . Thanks for your valuable comments and suggestions. I am on Labor Day vacation now and will be back on May 6th. I will address these when I come back! Once again, thank you! |
ZainRizvi
left a comment
There was a problem hiding this comment.
As a general prompting tip, claude does really well if you point it at both the RFC PR and the google doc we worked and and ask it to look for inconsistencies between it (and explicitly ask it to check the comments as well).
That's a good prompt for both the coding agent to start with and also for the reviewing agent to verify against.
Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>
b6ea827 to
e43fb53
Compare
TL,DR;What this commit does:
Key Changes1. Unified State Machine (Primary Change)Implementation: # Unified key pattern
oot:state:{delivery_id}:{repo}:{check_run_id} = {"state": "...", "timestamp": 1234.56}Key benefits:
Note that the direction graph below is for a single check run, reruns have different State transitions: stateDiagram-v2
direction LR
[*] --> DISPATCHED: webhook sends
DISPATCHED --> IN_PROGRESS: first callback
IN_PROGRESS --> COMPLETED: completion
IN_PROGRESS --> IN_PROGRESS: ❌ duplicate
DISPATCHED --> COMPLETED: ❌ skip IN_PROGRESS
COMPLETED --> COMPLETED: ❌ duplicate
COMPLETED --> IN_PROGRESS: ❌ wrong direction
[*] --> IN_PROGRESS: ❌ no dispatch
[*] --> COMPLETED: ❌ no dispatch
Timing Metrics:
State types:
Valid flows:
Rejected:
2. Repo Level TrackingAdded trusted = {
"ci_metrics": {...},
"verified_repo": "...",
"downstream_repo_level": "L2" # NEW: from allowlist
}3. Error Handling & MessagesBehavior:
Example: |
- Unified state machine: Single JSON structure `{state, timestamp}`
- Per-job timestamps: Each job has independent timing
- Repo level tracking: Added `downstream_repo_level` from allowlist (L1-L4)
- Error handling: Honest messages, 5xx → green CI, 4xx → red CI
|
Hey, @huydhn, this PR is ready for review. Please take a look and we are looking forward to your feedback! |
- Refactor Makefile: separate directory creation for clarity in deployment process - Enhance Cross-Repo CI Relay Callback: handle edge case for CHECK_RUN_ID and improve repo level verification in result handler - Improve error handling in Cross-Repo CI Relay Callback action
Promote the canonical CRCR test repo from L1 to L2 so downstream CI results are reported to HUD. This supports end-to-end testing of the L2 relay (pytorch/test-infra#7967) and OOT HUD pipeline (pytorch/test-infra#8069). Pull Request resolved: #184482 Approved by: https://github.com/atalman
Promote the canonical CRCR test repo from L1 to L2 so downstream CI results are reported to HUD. This supports end-to-end testing of the L2 relay (pytorch/test-infra#7967) and OOT HUD pipeline (pytorch/test-infra#8069). Pull Request resolved: pytorch#184482 Approved by: https://github.com/atalman
There was a problem hiding this comment.
Please fix lint: https://github.com/pytorch/test-infra/actions/runs/26208735503/job/77170460585?pr=7967
cc @zxiiro for review as well before merging
## Summary This PR is an extension of L1 (pytorch#433), and it only adds another AWS Lambda function and some environment variables for handling result callbacks from DownStream CI. It is also associated with L2 implementation (pytorch/test-infra#7967) and should **only** be merged after L2 implementation is completed. ## File to change ``` text .github/workflows/ ├── crcr-on-pr.yml # Add result callback environment variables └── crcr-deploy-prod.yml # Add result callback environment variables crcr/ ├── Terrafile # Add result callback zip file download from test-infra └── aws/ ├── variables.tf # Add result callback environment variables ├── outputs.tf # Add result callback AWS Lambda function ├── secrets.tf # Add result callback secrets ├── security.tf # Add security configuration └── callback.tf # Add result callback AWS Lambda function configuration ``` ## Test Multiple deployments and verifications have been completed on a personal AWS environment. cc @fffrog --------- Co-authored-by: Thanh Ha <thanh.ha@linuxfoundation.org> Co-authored-by: Thanh Ha <zxiiro@gmail.com>
Author
Summary
Higher-level behaviors for
L3andL4are intentionally left for follow-up work.Architecture
The relay is split into two AWS Lambda functions:
webhooklambda function (Updated)opened/reopened/synchronized/closedactions, forwards repository_dispatch events to downstream reposcallbacklambda function (Added)queue timeandexecute timefor evolution toL3repoChanges
..github/ ├── workflows/ │ └── _lambda-do-release-runners.yml # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release │ └── actions/ └── cross-repo-ci-relay-callback/ └── action.yml # Composite action used by downstream workflows to report status back to the relay/result endpoint aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for allowlist/config/webhook/result/redis behavior ├── README.md # Project overview, local development, callback flow, and result-side validation steps ├── Makefile # Top-level local developer entrypoint for test / deploy / clean ├── local_server.py # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints ├── requirements.txt # Python dependencies required by the relay Lambdas │ ├── utils/ │ ├── allowlist.py # Loads, parses, and queries the downstream allowlist by rollout level │ ├── config.py # Shared runtime config loading and cached get_config() helper │ ├── gh_helper.py # GitHub App, repository_dispatch, and GitHub file access helpers │ ├── hud.py # HUD write helpers for downstream result reporting │ ├── jwt_helper.py # Helpers for minting/verifying relay callback tokens │ ├── redis_helper.py # Redis helpers for allowlist cache, OOT state, and timing data │ └── misc.py # Shared TypedDict definitions and HTTPException │ ├── webhook/ │ ├── Makefile # Build/package/deploy commands for the webhook Lambda │ ├── lambda_function.py # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events │ └── event_handler.py # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos │ └── callback/ ├── Makefile # Build/package/deploy commands for the result Lambda ├── lambda_function.py # Result Lambda entrypoint: verifies callback token and GitHub OIDC token └── callback_handler.py # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUDUsage
See README.md for more details.
Verification
We performed the following scenario verification on our AWS Lambda instance:
Terraform configuration
Unit Tests
Security
Callback payload carries full upstream webhook data back to HUD —
action.ymlbuilds the callback body by mutatinggithub.event.client_payload(which contains the entire original webhook payload: PR metadata, commits, author info) and addingstatus/conclusion/workflow_name/workflow_urlon top. This full blob is forwarded verbatim byhud.pyto HUD with no relay-side filtering. HUD receives both relay-trustedverified_repoand an unvalidated body — if HUD trusts self-reported fields inside the body oververified_repo, a manipulated dispatch payload could tamper with HUD records.Lambda callback URL is public and hardcoded — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation.
Only OIDC is used for verification — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed.
HUD Interaction
Design Principle: Transparent Relay & Decoupling
The Relay Server acts as a lightweight data passthrough layer. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data.
Security & Risk Mitigation
The relay uses OIDC authentication to guarantee the authenticity of the data source (Verified Repo). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level.