Multi server multi gpu by henrykironde · Pull Request #1367 · weecology/DeepForest

henrykironde · 2026-04-03T08:45:07Z

Description

Related Issue(s)

AI-Assisted Development

I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
I understand all the code I'm submitting
I have reviewed and validated all AI-generated code

AI tools used (if applicable):

codecov · 2026-04-04T06:34:05Z

Codecov Report

❌ Patch coverage is 82.50000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.34%. Comparing base (f3bd776) to head (912ca4a).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/deepforest/distributed.py	70.83%	14 Missing ⚠️
src/deepforest/main.py	80.64%	12 Missing ⚠️
src/deepforest/predict.py	76.19%	5 Missing ⚠️
src/deepforest/scripts/evaluate.py	71.42%	2 Missing ⚠️
src/deepforest/datasets/prediction.py	96.77%	1 Missing ⚠️
src/deepforest/scripts/train.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1367      +/-   ##
==========================================
- Coverage   86.96%   85.34%   -1.62%     
==========================================
  Files          26       27       +1     
  Lines        3712     3877     +165     
==========================================
+ Hits         3228     3309      +81     
- Misses        484      568      +84

Flag	Coverage Δ
unittests	`85.34% <82.50%> (-1.62%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bw4sz · 2026-04-10T17:05:38Z

Can you confirm what SLURM script you used to check this so i can match that?

bw4sz

I want to approve this, @jveitchmichaelis any objections? I have spoken to comet and they agree that the lack of multi-node GPU utilization graph is probably on their end and not anything wrong here.

bw4sz · 2026-04-16T17:56:06Z

remove references to torchrun, we can use srun alone.

bw4sz · 2026-04-16T18:17:06Z

To do, is @henrykironde comparing this with #1304 and decide if both are needing, but we want to get this done because it's broad and could make rebasing harder.

… evaluate, and predict

Adds reproducible Slurm helpers for multinode and large-tile prediction workflows.

Docs for multi-GPU and multi-node workflows.

jveitchmichaelis · 2026-05-17T21:49:27Z

+    return dist.get_rank() == 0
+
+
+def should_sync(trainer: Any | None = None) -> bool:


Is this required? I thoughtorchmetrics handles syncing?

We do not need this for tensor metrics. TorchMetrics artifacts already sync those across ranks. For non-tensor results (pandas DataFrames), we handle distribution explicitl. Without gathering predictions across ranks, each GPU keeps its own DataFrame, which led to duplicated or inconsistent outputs.

@jveitchmichaelis - can you take a look at this response and either merge if @henrykironde's response clears things up or the two of you get together to figure it out so we can get this one merged. Thanks!

bw4sz

This looks good, related to the question of when and how we should launch multi-gpu tests to verify future releases.

henrykironde force-pushed the multi-server-multi-gpu branch 3 times, most recently from 5f40515 to 0905ce4 Compare April 4, 2026 05:55

henrykironde force-pushed the multi-server-multi-gpu branch from 33ef29e to b9c119c Compare April 10, 2026 04:32

henrykironde marked this pull request as ready for review April 10, 2026 04:44

bw4sz reviewed Apr 15, 2026

View reviewed changes

bw4sz added the High Priority label Apr 16, 2026

henrykironde added 6 commits May 17, 2026 00:57

Add Lightning-based distributed runtime support for multi-node train,…

1b123ea

… evaluate, and predict

Add Hipergator smoke and prediction tests.

6957351

Adds reproducible Slurm helpers for multinode and large-tile prediction workflows.

Add concise distributed run docs.

2e9b835

Docs for multi-GPU and multi-node workflows.

fix rebase artifacts in trainer config and metrics

50145a4

Generalize Hipergator-specific term to “cluster”

c77e82f

Use srun instead of torchrun for cluster distributed jobs

cc12a58

henrykironde force-pushed the multi-server-multi-gpu branch from b9c119c to cc12a58 Compare May 17, 2026 21:32

jveitchmichaelis reviewed May 17, 2026

View reviewed changes

Move cluster scripts to src/deepforest/scripts/HPC/

912ca4a

ethanwhite assigned jveitchmichaelis and henrykironde May 21, 2026

bw4sz approved these changes Jun 3, 2026

View reviewed changes

bw4sz merged commit e7d3aa0 into weecology:main Jun 3, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi server multi gpu#1367

Multi server multi gpu#1367
bw4sz merged 7 commits into
weecology:mainfrom
henrykironde:multi-server-multi-gpu

henrykironde commented Apr 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

bw4sz commented Apr 10, 2026

Uh oh!

bw4sz left a comment

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

jveitchmichaelis May 17, 2026 •

edited

Loading

Uh oh!

henrykironde May 18, 2026

Uh oh!

ethanwhite May 21, 2026

Uh oh!

bw4sz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return dist.get_rank() == 0


		def should_sync(trainer: Any \| None = None) -> bool:

Conversation

henrykironde commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

AI-Assisted Development

Uh oh!

codecov Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bw4sz commented Apr 10, 2026

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

bw4sz commented Apr 16, 2026

Uh oh!

jveitchmichaelis May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henrykironde May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ethanwhite May 21, 2026

Choose a reason for hiding this comment

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

henrykironde commented Apr 3, 2026 •

edited

Loading

codecov Bot commented Apr 4, 2026 •

edited

Loading

jveitchmichaelis May 17, 2026 •

edited

Loading