Skip to content

Batch SCCs for parallel processing#21287

Open
ilevkivskyi wants to merge 2 commits intopython:masterfrom
ilevkivskyi:batch-sccs-parallel
Open

Batch SCCs for parallel processing#21287
ilevkivskyi wants to merge 2 commits intopython:masterfrom
ilevkivskyi:batch-sccs-parallel

Conversation

@ilevkivskyi
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi commented Apr 22, 2026

This is a follow-up for #21119

Implementation mostly straightforward. Some comments:

  • After experimenting with self-check and torch, it looks like 1/N for Mac and 1/2N for Linux should be a safe batch size limit. Because of Zipf's Law together with "rounding down" logic, the average batch size is significantly smaller than the limit. For example, with six workers on Linux, a single worker rarely takes more than 5% of the queue.
  • I add padding to module size_hint. Apparently, there are many empty __init__.py files, but processing an empty file still costs some non-trivial amount of time.
  • I use size of serialized tree as a better proxy for module complexity. This is probably not very important, but will avoid weird cases where we have a file with lots of comments. This will also account for conditional function body stripping. Btw we should probably serialize all dosctrings in ast_serialize as "<docstring>" or similar (we can't skip them completely).
  • I write cache for interface after each SCC in batch, while cache for all implementations is written in one go. There is no deep logic behind this, this is the simplest way to do it because of how code is currently structured. If needed for performance reasons, this can be tweaked one or other way (i.e. fewer larger cache commits of more smaller cache commits) with not too much effort.
  • While working on this I accidentally noticed that our crash detector always reports a worker as "still running", so I added a little wait in case of a crash.

cc @JukkaL

@github-actions

This comment has been minimized.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 22, 2026

I'll try this with our large internal repo to see how this impacts a repo where there is a lot of room for parallelism (on top of my WIP parallel checking improvements).

@ilevkivskyi
Copy link
Copy Markdown
Member Author

Tried this on self-check on Mac, and it looks like there is a little improvement ~3% compared to parent commit, although results are noisy. It looks like there is a merge conflict, going to resolve that now.

@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 22, 2026

Based on one measurement, on macOS this was ~3% faster on a huge internal repository (when including recent parsing improvements and sqlite sharding). I'll try a few different tuning parameters to see if they make any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants