Skip to content

Add pytest sharding core and timing plugin [CI 7/9]#1604

Open
merkelmarrow wants to merge 1 commit into
Xilinx:devfrom
merkelmarrow:7-ci-sharding-core-pr
Open

Add pytest sharding core and timing plugin [CI 7/9]#1604
merkelmarrow wants to merge 1 commit into
Xilinx:devfrom
merkelmarrow:7-ci-sharding-core-pr

Conversation

@merkelmarrow

Copy link
Copy Markdown
Contributor

This is PR 7 of 9 of a series intended to make CI faster and more robust.

This PR adds a new ci/ subdirectory and a finn_ci Python package inside it. This package is designed to let any marker stage be easily split into time-balanced shards. Note that PRs 8 and 9 wire the Jenkinsfiles into this package, so some of the package's surface won't be consumed until then (to keep this PR reviewable). This PR is safe to merge on its own because it does not affect the existing Jenkins pipeline and test changes do not change actual behaviour (only the way they are generated).

Currently, FINN's Jenkins pipeline is a handful of long marker stages, and the best way to improve wall clock time is to split those stages between more workers (i.e. sharding). Doing that reliably requires:

A) a definition of the board and stage matrix that both the tests and the build pipeline agree on
B) a way to assign tests to shards that is deterministic across every worker (so xdist collections match)
C) a way for per-test timing data to feed the next run so that shards are balanced

A - Consolidated config

The "finn_ci" package now owns the CI board and stage tables and any derivations over such, eliminating duplication between pytest and Jenkinsfiles. These configurations are now validated directly by the tests themselves.

B & C - Sharding and timing

There were several approaches available for this. For instance, the simplest is round-robin allocation, but that produces uneven shards and longer overall wall-clock times. Another approach is to record the durations of tests manually, store those durations somewhere, and have your CI consume that file. The problem with this is that it goes stale quickly as tests are added, and someone needs to refresh the timings file.

There exist off-the-shelf options, but none were exactly what the problem needed. For instance, pytest-shard can't balance by durations/weights, and pytest-split doesn't have awareness of xidst_group, so it splits checkpoint chains that need to stay together.

I ended up going with writing a simple pytest sharding plugin that does the following:

1. Grouping: Before assigning, tests are bucketed into groups. Tests marked with the same xdist_group form one group, because they hand files to each other and must stay together.
2. Selection: Every shard's pytest collects the same full set of tests for that marker. The plugin checks the config tables and a master timings file, and decides what tests this shard should keep, and throws away the rest. Each shard uses exactly the same calculation, so the assignment is deterministic.
3. Updating: Each group takes a certain amount of time. After a shard finishes, the plugin writes a small sidecar file next to the JUnit XML recording how long each group took. A full build later merges those into a persistent timing master file, which feeds step 2.

This is a self-healing system which rebalances as new tests are added, keeps grouped tests together, and falls back to simple round-robin on a cold start. The pytest plugin does nothing if the sharding flag isn't passed, so regular contributors won't experience any changes to local testing.

Other changes

  • test_end2end_bnn_pynq.py generated its test matrix in a fragile way, duplicating the board list in basic.py, and the per-board markers were a third copy. In addition, because a group was named by its position i in the generated list, inserting or removing a scenario would renumber every group after it (which would invalidate the new timing strategy). The board metadata now lives in finn_ci.config.BOARDS, and the test generates its matrix from that table.
  • print_pytest_failures.py is a new simple script that parses JUnit XML that has already been produced and prints a tail of test failures into the CI log, for observability. Never crashes a run.
  • Use @pytest.mark.shard(N) to pin a test to a particular shard.
  • Unit tests for all the above

@merkelmarrow

Copy link
Copy Markdown
Contributor Author

A note on tmp_path in this PR: the new tests specifcally test finn_ci, not real FINN builds, so it makes sense to me that you shouldn't need finn/FINN_BUILD_DIR to run these tests in particular (to check that your configs are correct). make_build_dir/robust_rmtree live in finn.util.basic, so using them here would couple this test suite to a full finn install, whereas finn_ci is deliberately importable without finn. Let me know if you think tmp_path isn't the right choice here

Introduce a small finn_ci package, importable without the finn package
installed, that becomes the single source of truth for the FINN CI
matrix and backs a pytest sharding plugin:

- config: the CI board (BOARDS) and per-row stage (STAGES tables plus
  helpers that the build pipeline derives from them)
- sharding: weight-balanced group-to-shard assignment by
  longest-processing-time-first packing, degrading to round-robin
- plugin: a pytest plugin that selects a shard by marker, keeps tests
  sharing an xdist_group on the same shard, and writes per-shard
  timing and shard-map files so later builds can balance by mesaured
  duration.

The plugin does nothing unless a shard count is requested.

Parametrise the BNN end2end matrix off the shared board table with
stable, value-derived xdist_group names, so editing the matrix no
longer renames unrelated groups or loses their timing history. The
board list now lives in finn_ci.config (BOARDS and TEST_BOARDS) and
the old test_board_map in finn.util.basic is removed. Group the
ipstitch gen, stitch and rtlsim checkpoint chain per mem_mode so
each step's output is on disk before the next test reads it.

Add a stdlib JUnit failure printer for printing per-test failure
context in CI logs, plus unit tests covering the config tables,
shard assignment, the plugin under xdist, the JSON helper, and
the pytest failure printer.

Signed-off-by: Marco Blackwell <mblackwe@amd.com>
@merkelmarrow merkelmarrow force-pushed the 7-ci-sharding-core-pr branch from 0a34113 to b6fad24 Compare June 11, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant