[CRCR] Initial implementation of L2 by KarhouTam · Pull Request #7967 · pytorch/test-infra

KarhouTam · 2026-04-14T14:46:20Z

Author

Summary

This PR implements the L2 levels of the cross-repository CI relay described in [RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends rfcs#90.
For the previous L1 implementation, please refer to Implement initial L1 cross-repo CI relay #7847.
Please refer to [RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends rfcs#90 (comment) for the overall implementation.
Please refer to RFC-0054: HUD Integration for Out-of-Tree CI Results rfcs#96 for the design of HUD side.
Please refer to OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages #8069 for the implementation of HUD side.

Higher-level behaviors for L3 and L4 are intentionally left for follow-up work.

Architecture

The relay is split into two AWS Lambda functions:

webhook lambda function (Updated)
- receives GitHub webhook PR and push events from the upstream repo
- validates webhook signatures and authenticates with AWS Secret Manager
- reads the downstream whitelist from the URL and stores it in Redis
- for opened/reopened/synchronized/closed actions, forwards repository_dispatch events to downstream repos
callback lambda function (Added)
- receives downstream callback payload through a public lambda function URL
- validates callback payload with OIDC
- reads the downstream whitelist from the URL and stores it in Redis
- extracts CI result information from the payload and uploads to PyTorch HUD
- records queue time and execute time for evolution to L3 repo

Changes

..github/
├── workflows/
│   └── _lambda-do-release-runners.yml     # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
    └── cross-repo-ci-relay-callback/
        └── action.yml                     # Composite action used by downstream workflows to report status back to the relay/result endpoint

aws/lambda/cross_repo_ci_relay/
├── tests/                                 # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md                              # Project overview, local development, callback flow, and result-side validation steps
├── Makefile                               # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py                        # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt                       # Python dependencies required by the relay Lambdas
│
├── utils/
│   ├── allowlist.py                       # Loads, parses, and queries the downstream allowlist by rollout level
│   ├── config.py                          # Shared runtime config loading and cached get_config() helper
│   ├── gh_helper.py                       # GitHub App, repository_dispatch, and GitHub file access helpers
│   ├── hud.py                             # HUD write helpers for downstream result reporting
│   ├── jwt_helper.py                      # Helpers for minting/verifying relay callback tokens
│   ├── redis_helper.py                    # Redis helpers for allowlist cache, OOT state, and timing data
│   └── misc.py                            # Shared TypedDict definitions and HTTPException
│
├── webhook/
│   ├── Makefile                           # Build/package/deploy commands for the webhook Lambda
│   ├── lambda_function.py                 # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│   └── event_handler.py                   # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
    ├── Makefile                           # Build/package/deploy commands for the result Lambda
    ├── lambda_function.py                 # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
    └── callback_handler.py                # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD

Usage

See README.md for more details.

Verification

We performed the following scenario verification on our AWS Lambda instance:

Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow.
Test with Downstream workflow send callback payload through the added action to the result lambda, then extract CI result information and send to PyTorch HUD.

Terraform configuration

[WIP] Add L2 CRCR deployment configuration ci-infra#446

Unit Tests

Unit Tests (Mock)

Security

Callback payload carries full upstream webhook data back to HUD — action.yml builds the callback body by mutating github.event.client_payload (which contains the entire original webhook payload: PR metadata, commits, author info) and adding status/conclusion/workflow_name/workflow_url on top. This full blob is forwarded verbatim by hud.py to HUD with no relay-side filtering. HUD receives both relay-trusted verified_repo and an unvalidated body — if HUD trusts self-reported fields inside the body over verified_repo, a manipulated dispatch payload could tamper with HUD records.
Lambda callback URL is public and hardcoded — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation.
Only OIDC is used for verification — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed.

HUD Interaction

Design Principle: Transparent Relay & Decoupling
The Relay Server acts as a lightweight data passthrough layer. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data.
Security & Risk Mitigation
The relay uses OIDC authentication to guarantee the authenticity of the data source (Verified Repo). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level.

meta-cla · 2026-04-14T14:46:27Z

Hi @KarhouTam!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

vercel · 2026-04-14T14:46:41Z

@KarhouTam is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

meta-cla · 2026-04-15T02:53:40Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

ZainRizvi

Some earlier comments plus a couple new ones from cross-referencing with the RFC.

ZainRizvi

A couple more things from cross-referencing with the RFC.

KarhouTam · 2026-04-30T23:25:56Z

Hey, @ZainRizvi . Thanks for your valuable comments and suggestions. I am on Labor Day vacation now and will be back on May 6th. I will address these when I come back! Once again, thank you!

ZainRizvi

As a general prompting tip, claude does really well if you point it at both the RFC PR and the google doc we worked and and ask it to look for inconsistencies between it (and explicitly ask it to check the comments as well).

That's a good prompt for both the coding agent to start with and also for the reviewing agent to verify against.

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>

KarhouTam · 2026-05-09T06:15:49Z

TL,DR;

What this commit does:

Unified state machine: Single JSON structure {state, timestamp}
Per-job timestamps: Each job has independent timing
Repo level tracking: Added downstream_repo_level from allowlist (L1-L4)
Error handling: Honest messages, 5xx → green CI, 4xx → red CI

Key Changes

1. Unified State Machine (Primary Change)

Implementation:

# Unified key pattern
oot:state:{delivery_id}:{repo}:{check_run_id} = {"state": "...", "timestamp": 1234.56}

Key benefits:

Per-job timestamps (independent tracking for each job)
Validation (reject invalid transitions, duplicates)

Note that the direction graph below is for a single check run, reruns have different check_run_id and are treated as separate jobs, so they won't violate the state machine since they won't have a prior IN_PROGRESS or COMPLETED record.

State transitions:

stateDiagram-v2
    direction LR

    [*] --> DISPATCHED: webhook sends
    DISPATCHED --> IN_PROGRESS: first callback
    IN_PROGRESS --> COMPLETED: completion

    IN_PROGRESS --> IN_PROGRESS: ❌ duplicate
    DISPATCHED --> COMPLETED: ❌ skip IN_PROGRESS
    COMPLETED --> COMPLETED: ❌ duplicate
    COMPLETED --> IN_PROGRESS: ❌ wrong direction
    [*] --> IN_PROGRESS: ❌ no dispatch
    [*] --> COMPLETED: ❌ no dispatch

Timing Metrics:

queue_time = dispatch_timestamp → in_progress_timestamp
execution_time = in_progress_timestamp → completed_timestamp
Timestamps from state records, not separate timing keys
Reruns considered as a new job, won't udpate the last run's timestamps

State types:

DISPATCHED: Repo-level state (check_run_id=DISPATCH_CHECK_RUN_ID) - one per repo
IN_PROGRESS / COMPLETED: Job-level states - independent per job

Valid flows:

Main: DISPATCHED → IN_PROGRESS → COMPLETED (straight line)
Reruns: IN_PROGRESS → IN_PROGRESS or COMPLETED → IN_PROGRESS (update timestamps)

Rejected:

Skip IN_PROGRESS, duplicate COMPLETED, no DISPATCHED state
Replay attack: duplicate IN_PROGRESS (same check_run_id)

2. Repo Level Tracking

Added downstream_repo_level field to trusted dict. Relay determines level once from allowlist (L1-L4), HUD doesn't recompute. Avoids sync issues if tiering changes.

trusted = {
    "ci_metrics": {...},
    "verified_repo": "...",
    "downstream_repo_level": "L2"  # NEW: from allowlist
}

3. Error Handling & Messages

Behavior:

HUD 5xx/network: caught, logged, return success (HUD outage shouldn't fail downstream CI)
HUD 4xx: propagate to caller (author must fix payload)
Honest messages: explain what happened + what to do

Example:

"An internal failure occurred. Your update was not saved, but the CI run is still valid. 
You can attempt progressive retries after X seconds or ignore this failure."

cc @ZainRizvi @fffrog @can-gaa-hou @jewelkm89 @subinz1

- Unified state machine: Single JSON structure `{state, timestamp}` - Per-job timestamps: Each job has independent timing - Repo level tracking: Added `downstream_repo_level` from allowlist (L1-L4) - Error handling: Honest messages, 5xx → green CI, 4xx → red CI

KarhouTam · 2026-05-12T12:00:57Z

Hey, @huydhn, this PR is ready for review. Please take a look and we are looking forward to your feedback!

cc @fffrog @can-gaa-hou @jewelkm89 @subinz1

- Refactor Makefile: separate directory creation for clarity in deployment process - Enhance Cross-Repo CI Relay Callback: handle edge case for CHECK_RUN_ID and improve repo level verification in result handler - Improve error handling in Cross-Repo CI Relay Callback action

Promote the canonical CRCR test repo from L1 to L2 so downstream CI results are reported to HUD. This supports end-to-end testing of the L2 relay (pytorch/test-infra#7967) and OOT HUD pipeline (pytorch/test-infra#8069). Pull Request resolved: #184482 Approved by: https://github.com/atalman

Promote the canonical CRCR test repo from L1 to L2 so downstream CI results are reported to HUD. This supports end-to-end testing of the L2 relay (pytorch/test-infra#7967) and OOT HUD pipeline (pytorch/test-infra#8069). Pull Request resolved: pytorch#184482 Approved by: https://github.com/atalman

atalman

Please fix lint: https://github.com/pytorch/test-infra/actions/runs/26208735503/job/77170460585?pr=7967

cc @zxiiro for review as well before merging

- Unify the name from "result*" to "callback*" - Fix lints

@fffrog

## Summary This PR is an extension of L1 (pytorch#433), and it only adds another AWS Lambda function and some environment variables for handling result callbacks from DownStream CI. It is also associated with L2 implementation (pytorch/test-infra#7967) and should **only** be merged after L2 implementation is completed. ## File to change ``` text .github/workflows/ ├── crcr-on-pr.yml # Add result callback environment variables └── crcr-deploy-prod.yml # Add result callback environment variables crcr/ ├── Terrafile # Add result callback zip file download from test-infra └── aws/ ├── variables.tf # Add result callback environment variables ├── outputs.tf # Add result callback AWS Lambda function ├── secrets.tf # Add result callback secrets ├── security.tf # Add security configuration └── callback.tf # Add result callback AWS Lambda function configuration ``` ## Test Multiple deployments and verifications have been completed on a personal AWS environment. cc @fffrog --------- Co-authored-by: Thanh Ha <thanh.ha@linuxfoundation.org> Co-authored-by: Thanh Ha <zxiiro@gmail.com>

KarhouTam changed the title ~~[CRCR] Initial implementation of L2~~ [WIP][CRCR] Initial implementation of L2 Apr 14, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2026

can-gaa-hou mentioned this pull request Apr 15, 2026

[WIP] Add L2 CRCR deployment configuration pytorch/ci-infra#446

Closed

KarhouTam force-pushed the crcr-L2 branch from 15984c8 to a202024 Compare April 16, 2026 07:56

subinz1 reviewed Apr 23, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/utils/hud.py Outdated

subinz1 reviewed Apr 23, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/README.md

subinz1 mentioned this pull request Apr 24, 2026

[WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages subinz1/test-infra#1

Draft

9 tasks

jewelkm89 reviewed Apr 25, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/utils/hud.py Outdated

subinz1 reviewed Apr 25, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/callback/result_handler.py Outdated

subinz1 reviewed Apr 25, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/callback/result_handler.py Outdated

subinz1 reviewed Apr 25, 2026

View reviewed changes

Comment thread .github/actions/cross-repo-ci-relay-callback/action.yml

ZainRizvi reviewed Apr 28, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/README.md Outdated

ZainRizvi reviewed Apr 30, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/README.md Outdated

Comment thread aws/lambda/cross_repo_ci_relay/utils/redis_helper.py Outdated

ZainRizvi reviewed May 4, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/utils/hud.py

Comment thread .github/actions/cross-repo-ci-relay-callback/action.yml

Comment thread aws/lambda/cross_repo_ci_relay/callback/result_handler.py Outdated

ZainRizvi reviewed May 4, 2026

View reviewed changes

CRCR L2 implementation

b98e042

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>

KarhouTam force-pushed the crcr-L2 branch 2 times, most recently from b6ea827 to e43fb53 Compare May 9, 2026 06:13

KarhouTam mentioned this pull request May 9, 2026

RFC-0054: HUD Integration for Out-of-Tree CI Results pytorch/rfcs#96

Open

KarhouTam force-pushed the crcr-L2 branch from e43fb53 to a810164 Compare May 9, 2026 07:03

subinz1 mentioned this pull request May 12, 2026

OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages #8069

Closed

9 tasks

KarhouTam changed the title ~~[WIP][CRCR] Initial implementation of L2~~ [CRCR] Initial implementation of L2 May 12, 2026

KarhouTam marked this pull request as ready for review May 12, 2026 11:56

KarhouTam force-pushed the crcr-L2 branch from e895875 to c2833bd Compare May 12, 2026 11:59

KarhouTam force-pushed the crcr-L2 branch from c2833bd to 820563d Compare May 13, 2026 01:14

atalman self-requested a review May 14, 2026 15:32

atalman reviewed May 14, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/utils/jwt_helper.py Outdated

Remove audience setting in OIDC token generation

6519fe2

atalman reviewed May 19, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/utils/jwt_helper.py Outdated

Fix comments 0519

c504322

atalman reviewed May 19, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/tests/test_jwt_helper.py

atalman reviewed May 19, 2026

View reviewed changes

Comment thread aws/lambda/cross_repo_ci_relay/tests/test_jwt_helper.py