Batch SCCs for parallel processing by ilevkivskyi · Pull Request #21287 · python/mypy

ilevkivskyi · 2026-04-22T11:17:03Z

This is a follow-up for #21119

Implementation mostly straightforward. Some comments:

After experimenting with self-check and torch, it looks like 1/N for Mac and 1/2N for Linux should be a safe batch size limit. Because of Zipf's Law together with "rounding down" logic, the average batch size is significantly smaller than the limit. For example, with six workers on Linux, a single worker rarely takes more than 5% of the queue.
I add padding to module size_hint. Apparently, there are many empty __init__.py files, but processing an empty file still costs some non-trivial amount of time.
I use size of serialized tree as a better proxy for module complexity. This is probably not very important, but will avoid weird cases where we have a file with lots of comments. This will also account for conditional function body stripping. Btw we should probably serialize all dosctrings in ast_serialize as "<docstring>" or similar (we can't skip them completely).
I write cache for interface after each SCC in batch, while cache for all implementations is written in one go. There is no deep logic behind this, this is the simplest way to do it because of how code is currently structured. If needed for performance reasons, this can be tweaked one or other way (i.e. fewer larger cache commits of more smaller cache commits) with not too much effort.
While working on this I accidentally noticed that our crash detector always reports a worker as "still running", so I added a little wait in case of a crash.

JukkaL · 2026-04-22T12:04:53Z

I'll try this with our large internal repo to see how this impacts a repo where there is a lot of room for parallelism (on top of my WIP parallel checking improvements).

ilevkivskyi · 2026-04-22T17:39:54Z

Tried this on self-check on Mac, and it looks like there is a little improvement ~3% compared to parent commit, although results are noisy. It looks like there is a merge conflict, going to resolve that now.

github-actions · 2026-04-22T18:09:15Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

JukkaL · 2026-04-22T18:47:25Z

Based on one measurement, on macOS this was ~3% faster on a huge internal repository (when including recent parsing improvements and sqlite sharding). I'll try a few different tuning parameters to see if they make any difference.

Batch SCCs for parallel processing

35168b7

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'upstream/master' into batch-sccs-parallel

f694c22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch SCCs for parallel processing#21287

Batch SCCs for parallel processing#21287
ilevkivskyi wants to merge 2 commits intopython:masterfrom
ilevkivskyi:batch-sccs-parallel

ilevkivskyi commented Apr 22, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

JukkaL commented Apr 22, 2026

Uh oh!

ilevkivskyi commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

JukkaL commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ilevkivskyi commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

JukkaL commented Apr 22, 2026

Uh oh!

ilevkivskyi commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

JukkaL commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilevkivskyi commented Apr 22, 2026 •

edited

Loading