Skip to content

Multi server multi gpu#1367

Merged
bw4sz merged 7 commits into
weecology:mainfrom
henrykironde:multi-server-multi-gpu
Jun 3, 2026
Merged

Multi server multi gpu#1367
bw4sz merged 7 commits into
weecology:mainfrom
henrykironde:multi-server-multi-gpu

Conversation

@henrykironde

@henrykironde henrykironde commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Description

Related Issue(s)

AI-Assisted Development

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting
  • I have reviewed and validated all AI-generated code

AI tools used (if applicable):

@henrykironde henrykironde force-pushed the multi-server-multi-gpu branch 3 times, most recently from 5f40515 to 0905ce4 Compare April 4, 2026 05:55
@codecov

codecov Bot commented Apr 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.50000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.34%. Comparing base (f3bd776) to head (912ca4a).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/deepforest/distributed.py 70.83% 14 Missing ⚠️
src/deepforest/main.py 80.64% 12 Missing ⚠️
src/deepforest/predict.py 76.19% 5 Missing ⚠️
src/deepforest/scripts/evaluate.py 71.42% 2 Missing ⚠️
src/deepforest/datasets/prediction.py 96.77% 1 Missing ⚠️
src/deepforest/scripts/train.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1367      +/-   ##
==========================================
- Coverage   86.96%   85.34%   -1.62%     
==========================================
  Files          26       27       +1     
  Lines        3712     3877     +165     
==========================================
+ Hits         3228     3309      +81     
- Misses        484      568      +84     
Flag Coverage Δ
unittests 85.34% <82.50%> (-1.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@henrykironde henrykironde force-pushed the multi-server-multi-gpu branch from 33ef29e to b9c119c Compare April 10, 2026 04:32
@henrykironde henrykironde marked this pull request as ready for review April 10, 2026 04:44
@bw4sz

bw4sz commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Can you confirm what SLURM script you used to check this so i can match that?

@bw4sz bw4sz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to approve this, @jveitchmichaelis any objections? I have spoken to comet and they agree that the lack of multi-node GPU utilization graph is probably on their end and not anything wrong here.

@bw4sz

bw4sz commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator

remove references to torchrun, we can use srun alone.

@bw4sz

bw4sz commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator

To do, is @henrykironde comparing this with #1304 and decide if both are needing, but we want to get this done because it's broad and could make rebasing harder.

@henrykironde henrykironde force-pushed the multi-server-multi-gpu branch from b9c119c to cc12a58 Compare May 17, 2026 21:32
return dist.get_rank() == 0


def should_sync(trainer: Any | None = None) -> bool:

@jveitchmichaelis jveitchmichaelis May 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? I thoughtorchmetrics handles syncing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this for tensor metrics. TorchMetrics artifacts already sync those across ranks. For non-tensor results (pandas DataFrames), we handle distribution explicitl. Without gathering predictions across ranks, each GPU keeps its own DataFrame, which led to duplicated or inconsistent outputs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jveitchmichaelis - can you take a look at this response and either merge if @henrykironde's response clears things up or the two of you get together to figure it out so we can get this one merged. Thanks!

@bw4sz bw4sz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, related to the question of when and how we should launch multi-gpu tests to verify future releases.

@bw4sz bw4sz merged commit e7d3aa0 into weecology:main Jun 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants