Skip to content

[RFC] diff: add diff.<driver>.process for external hunk providers#2120

Open
mmontalbo wants to merge 5 commits into
gitgitgadget:masterfrom
mmontalbo:mm/structural-diff-backend-clean
Open

[RFC] diff: add diff.<driver>.process for external hunk providers#2120
mmontalbo wants to merge 5 commits into
gitgitgadget:masterfrom
mmontalbo:mm/structural-diff-backend-clean

Conversation

@mmontalbo
Copy link
Copy Markdown

@mmontalbo mmontalbo commented May 22, 2026

This series adds diff.<driver>.process, a long-running subprocess
protocol that lets external tools provide hunks to git's diff and
blame pipelines.

Over the past 18 years, git's diff pipeline accumulated many
features that operate on hunks: word diff, function context,
color-moved, indent heuristic, blame. External tools can replace
the pipeline entirely (diff.<driver>.command) or select among
builtin algorithms (diff.<driver>.algorithm), but there is no way
for a tool to provide line-change information into the pipeline.
Tools that understand code structure (tree-sitter parsers,
format-aware analyzers, tools like Difftastic and Mergiraf) must
bypass git's pipeline and lose access to everything downstream.

The protocol follows filter.<driver>.process: pkt-line over
stdin/stdout, capability negotiation, one tool invocation per
git command. The tool receives file pairs and returns hunk
descriptors that git feeds into the standard xdiff pipeline.
All output features work normally.

Zero hunks with status=success means the tool considers the
files equivalent. git diff shows no output for the file, and
git blame skips the commit, attributing lines to earlier
commits.

On error or tool crash, git falls back silently to the builtin
diff algorithm. The feature is opt-in via diff.<driver>.process
and .gitattributes; unconfigured files are unaffected.

The series includes git diff-process-normalize, a built-in
tool that compares files line by line ignoring whitespace
(same logic as "git diff -w" via xdiff_compare_lines):

[diff "cdiff"]
    process = git diff-process-normalize

A whitespace-only boolean flag could serve this specific case.
The subprocess protocol is more general, allowing any tool to
participate without further changes to git.

@mmontalbo
Copy link
Copy Markdown
Author

/preview

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 22, 2026

Preview email sent as pull.2120.git.1779415095.gitgitgadget@gmail.com

@mmontalbo mmontalbo changed the title [RFC] diff: add diff.<driver>.process for external hunk providers [RFC] diff: add diff.<driver>.process for external hunk providers May 22, 2026
@mmontalbo
Copy link
Copy Markdown
Author

/preview

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 22, 2026

Preview email sent as pull.2120.git.1779415483.gitgitgadget@gmail.com

@mmontalbo mmontalbo force-pushed the mm/structural-diff-backend-clean branch from dfcb1c3 to 8c7359b Compare May 22, 2026 02:09
@mmontalbo
Copy link
Copy Markdown
Author

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 22, 2026

Submitted as pull.2120.git.1779415884.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2120/mmontalbo/mm/structural-diff-backend-clean-v1

To fetch this version to local tag pr-2120/mmontalbo/mm/structural-diff-backend-clean-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2120/mmontalbo/mm/structural-diff-backend-clean-v1

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 22, 2026

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series adds diff.<driver>.process, a long-running subprocess protocol
> that lets external tools provide hunks to git's diff and blame pipelines.
>
> Over the past 18 years, git's diff pipeline accumulated many features that
> operate on hunks: word diff, function context, color-moved, indent
> heuristic, blame. External tools can replace the pipeline entirely
> (diff.<driver>.command) or select among builtin algorithms
> (diff.<driver>.algorithm), but there is no way for a tool to provide
> line-change information into the pipeline. Tools that understand code
> structure (tree-sitter parsers, format-aware analyzers, tools like
> Difftastic and Mergiraf) must bypass git's pipeline and lose access to
> everything downstream.
>
> The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
> capability negotiation, one tool invocation per git command. The tool
> receives file pairs and returns hunk descriptors that git feeds into the
> standard xdiff pipeline. All output features work normally.
>
> Zero hunks with status=success means the tool considers the files
> equivalent. git diff shows no output for the file, and git blame skips the
> commit, attributing lines to earlier commits.
>
> On error or tool crash, git falls back silently to the builtin diff
> algorithm. The feature is opt-in via diff.<driver>.process and
> .gitattributes; unconfigured files are unaffected.
>
> The series includes git diff-process-normalize, a built-in tool that
> compares files line by line ignoring whitespace (same logic as "git diff -w"
> via xdiff_compare_lines):

Interesting.

If the goal is purely to normalize content before comparison
(e.g. stripping comments or canonicalizing formatting), we already
have the `textconv` mechanism.  While `textconv` is a "one-shot"
per-file process, it is significantly simpler.

I suspect, however, that the primary focus here is to allow external
tools to provide structural alignment (e.g. for AST- aware diffs
like Difftastic or Mergiraf) without losing the original content in
the display.  Unlike `textconv`, which transforms the text the user
sees, this protocol lets the display remain identical to the source
while using a custom engine for the line-matching logic.

If that is the intent, it should be stated more explicitly in the
documentation and commit messages.  The "whitespace-normalize"
demonstration in [PATCH 5/5] is misleading because it's exactly the
case where `textconv` would be sufficient.

I am afraid that the use of a long-running subprocess for every
diff/blame invocation adds significant complexity and overhead.  In
particular, wouldn't the `blame` implementation performs a
round-trip to the subprocess for every commit in the history?  Even
with a persistent process, the overhead of serializing and
deserializing the entire file content twice (old and new) for every
commit could be prohibitive for large files or deep histories.

So, I dunno.

Comment thread xdiff-interface.c
@@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co
if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +/*
> + * Populate the changed[] arrays from externally supplied hunks,
> + * bypassing the diff algorithm.  Validates that hunks are in order,
> + * non-overlapping, and within bounds.
> + *
> + * Returns 0 on success, -1 on validation failure.
> + */
> +static int xdl_populate_hunks_from_external(xdfenv_t *xe,
> +					    const struct xdl_hunk *hunks,
> +					    size_t nr_hunks)
> +{
> +	size_t i;
> +	long j, prev_old_end = 0, prev_new_end = 0;
> +	long total_old = 0, total_new = 0;
> +
> +	/*
> +	 * Clear changed[] arrays.  xdl_prepare_env() may have dirtied
> +	 * them via xdl_cleanup_records().  The allocation is nrec + 2
> +	 * elements; changed points one past the start (see xprepare.c).
> +	 */
> +	memset(xe->xdf1.changed - 1, 0,
> +	       (xe->xdf1.nrec + 2) * sizeof(bool));
> +	memset(xe->xdf2.changed - 1, 0,
> +	       (xe->xdf2.nrec + 2) * sizeof(bool));

This, especially the starting offset of -1, looks horrible.  The
internal layout of xdfenv_t might happen to match the way the above
code expects, which is how xdl_prepare_ctx() may have give you, but
it somehow feels brittle.  I guess the assumption that changed[]
does not point at the beginning of the allocated area (e.g., it is a
no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
it cannot be helped.  Sigh.

>  int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
>  	     xdemitconf_t const *xecfg, xdemitcb_t *ecb) {
>  	xdchange_t *xscr;
>  	xdfenv_t xe;
>  	emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff;
>  
> -	if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) {
> -
> -		return -1;
> +	if (xpp->external_hunks) {
> +		if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
> +			return -1;
> +		if (xdl_populate_hunks_from_external(&xe,
> +						     xpp->external_hunks,
> +						     xpp->external_hunks_nr) < 0) {
> +			/*
> +			 * Invalid external hunks; fall back to the
> +			 * builtin diff algorithm.  Re-runs
> +			 * xdl_prepare_env() via xdl_do_diff().
> +			 */
> +			xdl_free_env(&xe);
> +			if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> +				return -1;

If the external tool keeps sending bogus hunks, silently falling
back to what we would have done if there weren't any external stuff
may be necessary to pleasantly keep using Git, but two and a half
short comments here.

 (1) "What we would have done" is exactly the same as what appears
     in the corresponding "else" block.  Can we make sure that we do
     not have to keep updating both copies in the future with some
     code rearrangement?

 (2) The writer of the external tool may want to see some trace of
     warning under certain flags when a failure of the tool forces
     the receiving end to fallback.

 (3) If the tool throws too many broken replies, perhaps we want to
     disable it automatically?

> +		}
> +	} else {
> +		if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> +			return -1;
>  	}
> +
>  	if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
>  	    xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
>  	    xdl_build_script(&xe, &xscr) < 0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +/*
> > + * Populate the changed[] arrays from externally supplied hunks,
> > + * bypassing the diff algorithm.  Validates that hunks are in order,
> > + * non-overlapping, and within bounds.
> > + *
> > + * Returns 0 on success, -1 on validation failure.
> > + */
> > +static int xdl_populate_hunks_from_external(xdfenv_t *xe,
> > +                                         const struct xdl_hunk *hunks,
> > +                                         size_t nr_hunks)
> > +{
> > +     size_t i;
> > +     long j, prev_old_end = 0, prev_new_end = 0;
> > +     long total_old = 0, total_new = 0;
> > +
> > +     /*
> > +      * Clear changed[] arrays.  xdl_prepare_env() may have dirtied
> > +      * them via xdl_cleanup_records().  The allocation is nrec + 2
> > +      * elements; changed points one past the start (see xprepare.c).
> > +      */
> > +     memset(xe->xdf1.changed - 1, 0,
> > +            (xe->xdf1.nrec + 2) * sizeof(bool));
> > +     memset(xe->xdf2.changed - 1, 0,
> > +            (xe->xdf2.nrec + 2) * sizeof(bool));
>
> This, especially the starting offset of -1, looks horrible.  The
> internal layout of xdfenv_t might happen to match the way the above
> code expects, which is how xdl_prepare_ctx() may have give you, but
> it somehow feels brittle.  I guess the assumption that changed[]
> does not point at the beginning of the allocated area (e.g., it is a
> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
> it cannot be helped.  Sigh.
>

Agreed it is ugly. I wanted to make sure the entire changed[] including
sentinels were clear as a defensive measure for downstream callers
(xdl_change_compact). I agree this results in something that is ugly
and brittle, but in the end I thought it was superior to relying on the
fact that upstream zeroes the entire changed[] array. Maybe if the
comment was more explicit about why this is happening it would be
helpful?

    /*
     * Clear changed[] arrays including sentinels.
     * xdl_prepare_env() may have dirtied them via
     * xdl_cleanup_records(), and xdl_change_compact() reads
     * the sentinel at changed[-1] during backward scans.
     */

> >  int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> >            xdemitconf_t const *xecfg, xdemitcb_t *ecb) {
> >       xdchange_t *xscr;
> >       xdfenv_t xe;
> >       emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff;
> >
> > -     if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) {
> > -
> > -             return -1;
> > +     if (xpp->external_hunks) {
> > +             if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
> > +                     return -1;
> > +             if (xdl_populate_hunks_from_external(&xe,
> > +                                                  xpp->external_hunks,
> > +                                                  xpp->external_hunks_nr) < 0) {
> > +                     /*
> > +                      * Invalid external hunks; fall back to the
> > +                      * builtin diff algorithm.  Re-runs
> > +                      * xdl_prepare_env() via xdl_do_diff().
> > +                      */
> > +                     xdl_free_env(&xe);
> > +                     if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> > +                             return -1;
>
> If the external tool keeps sending bogus hunks, silently falling
> back to what we would have done if there weren't any external stuff
> may be necessary to pleasantly keep using Git, but two and a half
> short comments here.
>
>  (1) "What we would have done" is exactly the same as what appears
>      in the corresponding "else" block.  Can we make sure that we do
>      not have to keep updating both copies in the future with some
>      code rearrangement?
>

How about something like this:

  if (xpp->external_hunks) {
      if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
          return -1;
      if (xdl_populate_hunks_from_external(&xe,
                       xpp->external_hunks,
                       xpp->external_hunks_nr) == 0)
          goto diff_done;
      xdl_free_env(&xe);
  }

  if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
      return -1;

  diff_done:

>  (2) The writer of the external tool may want to see some trace of
>      warning under certain flags when a failure of the tool forces
>      the receiving end to fallback.
>

In diff.c how about we emit a warning rather than a trace on
fallback:

    warning(_("diff process failed for '%s',"
              " falling back to builtin diff"),
            name_a);

>  (3) If the tool throws too many broken replies, perhaps we want to
>      disable it automatically?
>

For the RFC I wanted to keep it simple, but I definitely agree. A configurable
failure policy makes a lot of sense to me (e.g., disable after N failures).

> > +             }
> > +     } else {
> > +             if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> > +                     return -1;
> >       }
> > +
> >       if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
> >           xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
> >           xdl_build_script(&xe, &xscr) < 0) {

@mmontalbo mmontalbo force-pushed the mm/structural-diff-backend-clean branch 3 times, most recently from 9d61788 to fd6978a Compare May 22, 2026 18:12
mmontalbo added 5 commits May 22, 2026 11:17
Add two new xpparam_t fields (external_hunks, external_hunks_nr)
that let callers supply pre-computed hunks.  When set, xdl_diff()
populates the changed[] arrays from these hunks instead of running
the diff algorithm, then continues through compaction and emission
as usual.

Validate supplied hunks before use: reject out-of-bounds line
numbers, overlapping or out-of-order hunks, negative counts, and
violations of the synchronization invariant (unchanged line counts
must match between files).  On validation failure, fall back to
the builtin diff algorithm.

Skip trim_common_tail() in xdi_diff() when external hunks are
present, since external hunks reference line numbers in the
original content.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add a new per-driver configuration key that specifies the command
for a long-running diff process.

The name follows filter.<driver>.process: a long-running subprocess
that stays alive across files within a single git invocation.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add support for external diff processes that communicate via the
long-running process protocol (pkt-line over stdin/stdout).

A diff process is configured per userdiff driver:

    [diff "cdiff"]
        process = /path/to/diff-tool

The tool provides custom line-matching: it receives file pairs
and returns hunks that reference original line numbers.  Unlike
textconv, which transforms the displayed content, the diff
output shows the actual file while the tool controls which
lines are marked as changed.

The handshake negotiates version=1 and capability=hunks.  Per-file
requests send command=hunks, pathname, and both file contents as
packetized data.  The tool responds with hunk lines and a status
packet.  On error, git falls back to the builtin diff algorithm
with a warning.

Zero hunks with status=success means the tool considers the
files equivalent.  Git skips diff output for that file.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
When a diff process is configured via diff.<driver>.process,
consult it during blame's per-commit diffing.  If the process
returns zero hunks for a commit's changes to a file, treat the
commit as having no changes, causing blame to attribute lines
to earlier commits.

The subprocess is long-running (one startup cost amortized
across the blame traversal), but each commit in the file's
history incurs a round-trip to the tool.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add git diff-process-normalize as a minimal reference
implementation of the diff process protocol.  It compares
files line by line using xdiff_compare_lines() with
XDF_IGNORE_WHITESPACE (same logic as "git diff -w").  If all
lines match, it returns zero hunks; otherwise it returns an
error so git falls back to the builtin diff algorithm.

    [diff "cdiff"]
        process = git diff-process-normalize

Update documentation to describe the diff process protocol
and the built-in normalize tool.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
@mmontalbo mmontalbo force-pushed the mm/structural-diff-backend-clean branch from fd6978a to 9818598 Compare May 22, 2026 18:17
@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 22, 2026

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series adds diff.<driver>.process, a long-running subprocess protocol
> > that lets external tools provide hunks to git's diff and blame pipelines.
> >
> > Over the past 18 years, git's diff pipeline accumulated many features that
> > operate on hunks: word diff, function context, color-moved, indent
> > heuristic, blame. External tools can replace the pipeline entirely
> > (diff.<driver>.command) or select among builtin algorithms
> > (diff.<driver>.algorithm), but there is no way for a tool to provide
> > line-change information into the pipeline. Tools that understand code
> > structure (tree-sitter parsers, format-aware analyzers, tools like
> > Difftastic and Mergiraf) must bypass git's pipeline and lose access to
> > everything downstream.
> >
> > The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
> > capability negotiation, one tool invocation per git command. The tool
> > receives file pairs and returns hunk descriptors that git feeds into the
> > standard xdiff pipeline. All output features work normally.
> >
> > Zero hunks with status=success means the tool considers the files
> > equivalent. git diff shows no output for the file, and git blame skips the
> > commit, attributing lines to earlier commits.
> >
> > On error or tool crash, git falls back silently to the builtin diff
> > algorithm. The feature is opt-in via diff.<driver>.process and
> > .gitattributes; unconfigured files are unaffected.
> >
> > The series includes git diff-process-normalize, a built-in tool that
> > compares files line by line ignoring whitespace (same logic as "git diff -w"
> > via xdiff_compare_lines):
>
> Interesting.
>
> If the goal is purely to normalize content before comparison
> (e.g. stripping comments or canonicalizing formatting), we already
> have the `textconv` mechanism.  While `textconv` is a "one-shot"
> per-file process, it is significantly simpler.
>
> I suspect, however, that the primary focus here is to allow external
> tools to provide structural alignment (e.g. for AST- aware diffs
> like Difftastic or Mergiraf) without losing the original content in
> the display.  Unlike `textconv`, which transforms the text the user
> sees, this protocol lets the display remain identical to the source
> while using a custom engine for the line-matching logic.
>
> If that is the intent, it should be stated more explicitly in the
> documentation and commit messages.  The "whitespace-normalize"
> demonstration in [PATCH 5/5] is misleading because it's exactly the
> case where `textconv` would be sufficient.
>

Thank you for looking at this.

Yes, you have correctly identified the primary focus. My intention with the
whitespace normalization example was to provide a kind of "hello world"
diff process that would showcase how such a tool could interact with
the pipeline further down (i.e., blame vs diff output). However, I do agree
that it is a confusing example because it seems to clash with something
`textconv` already provides. I will update messaging across the series to
make the true intention of these changes more clear.

> I am afraid that the use of a long-running subprocess for every
> diff/blame invocation adds significant complexity and overhead.  In
> particular, wouldn't the `blame` implementation performs a
> round-trip to the subprocess for every commit in the history?  Even
> with a persistent process, the overhead of serializing and
> deserializing the entire file content twice (old and new) for every
> commit could be prohibitive for large files or deep histories.
>
> So, I dunno.

I hear you on this point. Anecdotally, my measurement of running blame
on diff.c:

Performance:
  without process:  0.57s
  with process:       0.67s

Blame attribution:
  without process:  726 unique commits
  with process:     721 unique commits (5 whitespace-only skipped)

Skipped commits:
  0ea7d5b6f8 diff.c: fix some recent whitespace style violations
  2775d92c53 diff.c: stylefix
  4b25d091ba Fix a bunch of pointer declarations (codestyle)
  a6080a0a44 War on whitespace
  eeefa7c90e Style fixes, add a space after if/for/while.

I was imagining we could potentially optimize performance by extending the
protocol to enable passing OIDs as a capability so tools could read objects
directly without needing any serialization/deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant