feat: add support for GIT LFS checkout#3909
Conversation
f9e4186 to
255aa11
Compare
|
@lizrabuya could you ask for a review in #ask-agents or in our project channel tagging Josh/Ming once you address Dahtey's comments? |
…y are not runnable on windows
242cb29 to
9a649dd
Compare
|
going to tag and throw @buildsworth-bk in here for a review |
There was a problem hiding this comment.
Adds opt-in Git LFS support to the default checkout: a --git-lfs-enabled flag, GIT_LFS_SKIP_SMUDGE=1 so checkout writes pointer files rather than downloading inline, and an explicit git lfs fetch / git lfs checkout after checkout with its own bounded retrier. The double-retry concern from the earlier round looks resolved — exhausting the inner retrier returns WasRetried: true, which breaks the outer checkout retrier, so a persistent LFS failure fails once rather than being re-attempted ~30 times.
Two things to confirm and one small cleanup, all inline. The one I'd most want an answer on is the env var name: the SUP-6529 issue specifies BUILDKITE_GIT_LFS, but this reads BUILDKITE_GIT_LFS_ENABLED — if those don't match what the backend emits, the feature silently never activates.
The PR has no risk: label, so I'm leaving this as a comment rather than an approval.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 518, then answer my questions about the findings.
| // Fail fast before any git work if git-lfs is required but missing. | ||
| if e.GitLFSEnabled { | ||
| if _, err := exec.LookPath("git-lfs"); err != nil { | ||
| return fmt.Errorf("BUILDKITE_GIT_LFS_ENABLED=true but git-lfs binary is not found on PATH: %w", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Non-blocking: Two things about this fail-fast check. First, the comment says "before any git work", but the mirror update and clone earlier in defaultCheckoutPhase have already run by the time we reach here. Second — and more impactful — the returned error is a plain fmt.Errorf, not a *gitError, so the outer checkout retrier in checkout() doesn't recognise it: errors.As(err, &errGit) fails and it falls through to the default: branch, which removes the checkout dir and retries. A missing git-lfs binary is deterministic, so this re-clones the repo on every attempt (up to ~6×) before finally failing — the opposite of fail-fast. Moving the check above the mirror/clone block and returning something that breaks the outer retrier (e.g. a *gitError{..., WasRetried: true}, the way gitLFSFetchCheckout does on exhaustion) would let it fail once, immediately.
There was a problem hiding this comment.
The interesting part here is the retry interaction. gitLFSFetchCheckout runs its own 5-attempt budget and tags an exhausted failure with WasRetried + gitErrorLFS, so the outer checkout retrier breaks (and gitErrorLFS falls through to return err without wiping the checkout dir) rather than multiplying this into ~30 attempts. I traced that path and it holds up. The fail-fast git-lfs lookup and the split fetch/checkout with distinct error prefixes are sensible.
Three non-blocking notes inline: a redundant GIT_LFS_SKIP_SMUDGE set, an env-var naming question worth confirming against the backend, and an error-path test that pays the full retry backoff. The PR also has no risk: label, so it falls outside AI-only approval.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 519, then answer my questions about the findings.
|
@buildsworth-bk review please |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/751 |
There was a problem hiding this comment.
Previous Buildsworth feedback looks resolved: the env-var naming question has a backend-confirmation reply, the duplicate GIT_LFS_SKIP_SMUDGE set was removed, and the retry-test runtime note has an explicit author decision. This pass found two checkout-path issues that I think need fixing before merge: the fail-fast binary check uses a different environment than the checkout commands, and LFS still performs an unfiltered fetch for sparse checkouts.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 751, then answer my questions about the findings.
| // Fail fast before any git work if git-lfs is required but missing. | ||
| // This operation only handles default checkout behavior, so it's possible for a custom checkout hook to require git-lfs but not have this check. That's a bit unfortunate, but we can add it to custom hooks later if needed. | ||
| if e.GitLFSEnabled { | ||
| if _, err := exec.LookPath("git-lfs"); err != nil { |
There was a problem hiding this comment.
Blocking: exec.LookPath reads the agent process PATH, but checkout commands are resolved from e.shell.Env via shell.Command / AbsolutePath, and hooks/plugins can mutate that environment before checkout. That can fail a job where git-lfs was added to the job PATH, or pass this check even though the shell PATH used for git lfs ... cannot resolve it. Can this check use the shell environment too (for example e.shell.AbsolutePath("git-lfs")) and cover that with a test that changes sh.Env rather than only t.Setenv?
| } | ||
| } | ||
|
|
||
| if e.GitLFSEnabled { |
There was a problem hiding this comment.
Blocking: sparseCheckoutActive is ignored here, so BUILDKITE_GIT_LFS_ENABLED=true with sparse checkout still runs an unfiltered git lfs fetch. That is the combination SUP-6529 calls out to avoid: sparse checkout users can still download LFS objects outside the sparse set, defeating the sparse checkout and potentially blowing up disk/network usage. Can this branch be gated on !sparseCheckoutActive, or otherwise constrained to the sparse paths, before merging?
| if err != nil && args.Retry { | ||
| return &gitError{error: err, Type: gitErrorLFS, WasRetried: true} | ||
| } | ||
| return err |
There was a problem hiding this comment.
We don't seem to be returning an error if retrying is not set to true here? WasRetried should return the value of args.Retry so even if retries are disabled, the error's context is not lost
Description
Add configuration and environment variables when the backend sets
BUILDKITE_GIT_LFS_ENABLED=truein the job environment. WhenBUILDKITE_GIT_LFS_ENABLED=trueis set, the agent will check if thegit-lfsbinary is installed and proceed to perform agit lfs fetch+git lfs checkout.Changes
Configuration & CLI (clicommand/bootstrap.go, internal/job/config.go)
Checkout behaviour (internal/job/checkout.go)
LFS fetch helper (internal/job/git.go)
Smudge bypass (internal/job/executor.go)
Tests (internal/job/checkout_test.go)
Windows test caveats
Two of the LFS error-path subtests (
LFS enabled git lfs command failsandLFS enabled git lfs fetch fails) are skipped on Windows. They cannot run there for a fundamental reason, not flakiness:PATHto point at a fakegit-lfsscript that exits non-zero, so we can assert our error-wrapping behaviour.git-lfs.exeinsideGIT_EXEC_PATH(the Git installation'slibexec/git-coredirectory). Whengitresolves a subcommand likegit lfs ..., it looks inGIT_EXEC_PATHbefore falling back toPATH.git-lfs.exeruns, succeeds, and the test fails because we expected our fake to return an error.gitLFSFetchCheckoutandgit lfs install --localfailure handling) is platform-agnostic, so the Linux runs give us the real coverage. Skipping on Windows is a test-harness limitation, not a behavioural gap.Additionally, the two happy-path subtests (
LFS disabled,LFS enabled binary present) useos.MkdirTemp+ a best-effortRemoveAllint.Cleanuprather thant.TempDir(). On Windows, git's child processes (credential helpers, thegit-lfsfilter-process) can keep file handles open past their parent's exit.t.TempDir()'s cleanup is strict and fails the test onunlinkat ... being used by another processerrors; the best-effort variant lets the test pass while still cleaning up what it can.Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
The main function changes were implemented by me based on the recommendation from the research done for this function. I used Claude to review the code and to assist with writing the tests.