Skip to content

[CRCR] Implement CRCR upstream check run management for L3/L4 jobs #8119

Open
can-gaa-hou wants to merge 4 commits into
pytorch:mainfrom
can-gaa-hou:crcr-L3
Open

[CRCR] Implement CRCR upstream check run management for L3/L4 jobs #8119
can-gaa-hou wants to merge 4 commits into
pytorch:mainfrom
can-gaa-hou:crcr-L3

Conversation

@can-gaa-hou

@can-gaa-hou can-gaa-hou commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Architecture

  • webhook function:
    • Create PR label handling function for L3 repo (refer to the 3 scenario cases mentioned in Adding L1-L4 design to RFC-0050 rfcs#93)
      • Scenario 1: Label arrives before the downstream workflow is triggered. So we cached this information in Redis using mark_check_run_wanted and let the callback lambda create this check run.
      • Scenario 2: Label arrives during the downstream workflow is running. The callback lambda will cache workflow information beforehand so it can immediately create the in_progress check run.
      • Scenario 3: Label arrives after the downstream workflow is done. If the cached workflow information is still alive in Redis (3 days by default, could be set by OOT_STATUS_TTL), it will create a completed check run immediately. Otherwise, it will not create a check run.
    • Dispatch function will check for L3 label or L4, and store this information in Redis for the callback to check whether a check run is needed.
    • Create check run and check suite handling functions for the downstream workflow re-run mechanism within the check run.
  • callback function:
    • Handle upstream check-run creation for L3/L4
      • Scenario 1 & L4: Check in Redis by calling is_check_run_wanted to see if this PR needs a check run. If so, immediately create one.
      • All Scenario: Store workflow information in Redis for check run creation in the webhook.

Changes

aws/lambda/cross_repo_ci_relay/
├── tests/                         # Add more unit tests
├── allowlist.py                   # Update utils function for L3
├── redis_helper.py                # Set more keys for L3
├── gh_helper.py                   # Update utils function for L3
├── misc.py                        # Update utils function for L3
├── event_handler.py               # Create PR label/check run/check suites handling function
└── callback_handler.py            # Handle upstream check-run creation

Verification

We performed the following scenario verification on our AWS Lambda instance:

  • L3:

    • L3 labels named ciflow/crcr/{device} are added immediately after the PR is created, and show up in the corresponding check-run on the PR with the name crcr/{repo}/{workflow_name}.
    • After clicking into the check-run, the corresponding information is correct.
    • L3 labels are added while the workflow is running, which should show up the in_progress check-run.
    • L3 labels are added after the workflow is done, which should show up the completed check-run.
    • Check run is updated when the PR with L3 labels is reopened or synchronized.
  • L4:

    • Check run should be created after the PR is opened.
    • Check run is updated when the PR is reopened or synchronized.
  • Re-run

    • Clicking the Re-run button in each failed check run will trigger the corresponding downstream workflow to re-run and update the check run status to in_progress.
    • Clicking the Re-run all jobs or Re-run all failed jobs button will trigger the corresponding downstream workflows in the check suite and update the corresponding check run status to in_progress.

Unit Tests

  • Unit Tests (Mock)

TODO

cc @albanD @fffrog @KarhouTam @atalman @huydhn @zxiiro @subinz1 @jewelkm89

@vercel

vercel Bot commented May 27, 2026

Copy link
Copy Markdown

@can-gaa-hou is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 27, 2026
@can-gaa-hou can-gaa-hou changed the title [WIP] Implement CRCR upstream check run management for L3/L4 jobs [WIP] [CRCR] Implement CRCR upstream check run management for L3/L4 jobs Jun 4, 2026
@can-gaa-hou can-gaa-hou force-pushed the crcr-L3 branch 7 times, most recently from 8fee6c5 to 30e0f25 Compare June 10, 2026 09:27
@can-gaa-hou can-gaa-hou changed the title [WIP] [CRCR] Implement CRCR upstream check run management for L3/L4 jobs [CRCR] Implement CRCR upstream check run management for L3/L4 jobs Jun 10, 2026
@can-gaa-hou can-gaa-hou marked this pull request as ready for review June 10, 2026 09:30
@can-gaa-hou can-gaa-hou force-pushed the crcr-L3 branch 2 times, most recently from 8f62972 to d1cb42c Compare June 11, 2026 06:45
logger.info(
"l3_labeled: no job info for repo=%s; check run marked wanted for callback",
downstream_repo,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstream_token is minted for config.upstream_repo (always pytorch/pytorch), but it's called inside the for downstream_repo in l3_repos loop. If a device maps to multiple repos, this creates redundant installation tokens for the same upstream repo on every iteration.

Consider hoisting it above the loop:

    upstream_token = gh_helper.get_repo_access_token(
        config.github_app_id,
        config.github_app_private_key,
        config.upstream_repo,
    )

    created: list[str] = []
    for downstream_repo in l3_repos:
        ...

This also gives a clean fail-fast — if the token mint fails, it would fail identically on every iteration anyway, so there's no benefit to retrying it per-repo inside the loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed with lazy mint since when job_info is None (mostly happens when the label is added before the workflow starts), we don't even need to call the GitHub API.

@can-gaa-hou can-gaa-hou force-pushed the crcr-L3 branch 3 times, most recently from 2a7dbf8 to 6b40dc7 Compare June 18, 2026 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants