Skip to content

Add parallel tuning on multiple remote GPUs using Ray #328

Open
isazi wants to merge 68 commits into
masterfrom
parallel_runner
Open

Add parallel tuning on multiple remote GPUs using Ray #328
isazi wants to merge 68 commits into
masterfrom
parallel_runner

Conversation

@isazi
Copy link
Copy Markdown
Collaborator

@isazi isazi commented Aug 13, 2025

Working on a simple parallel runner that uses Ray to distribute the benchmarking of different configurations to remote Ray workers.

@isazi isazi self-assigned this Aug 13, 2025
@isazi isazi marked this pull request as draft August 13, 2025 09:39
@sonarqubecloud
Copy link
Copy Markdown

@stijnh stijnh changed the title Simple parallel runner Add parallel tuning on multiple remote GPUs using Ray Jan 19, 2026
@stijnh stijnh self-assigned this Jan 19, 2026
@stijnh
Copy link
Copy Markdown
Member

stijnh commented Jan 20, 2026

The current parallel runner works. I've been able to run on multiple GPUs on DAS6-VU and DAS6-Leiden.

There are several remaining problems:

  • The timings are incorrect as the host assumes that the total time is just the sum over individual configurations
  • Use of tuning_options need to be refactored, as now it the entire object is sent to every node for each benchmark job
  • Logging information can be improved
  • The strategies are not parallel-aware yet (except brute-force)
  • A guide needs to be added to the docs explaining how to launch a Ray cluster on DAS6

@benvanwerkhoven
Copy link
Copy Markdown
Collaborator

benvanwerkhoven commented Feb 16, 2026

Sometimes with Python you run into an error and think: 'How on Earth has this error not surfaced years ago?'

It seems that observers has been None by default instead of an empty list since forever and miraculously it was never an issue. It seems that the code responsible for replacing a None value with an empty list is currently hidden in (and duplicated across) the backends, and because there is now code that (rightfully so) assumes observers is a list just before the backends are created this is suddenly an issue. The real issue is of course that the backends have somehow become responsible for sanitizing user input, which is not what a backend should do.

@sonarqubecloud
Copy link
Copy Markdown

@stijnh stijnh force-pushed the parallel_runner branch 3 times, most recently from ee74137 to a769aea Compare April 7, 2026 15:44
@stijnh stijnh force-pushed the parallel_runner branch from a769aea to c6345b5 Compare April 7, 2026 15:46
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 7, 2026

@stijnh
Copy link
Copy Markdown
Member

stijnh commented Apr 20, 2026

I have been tested this on Snellius and it appears to work fine with multiple GPUs across multiple nodes.

Remaining issues:

  • overhead_time can be negative sometimes. Needs some investigation
  • Fix the issues flagged by SonarQube

@benvanwerkhoven benvanwerkhoven marked this pull request as ready for review May 28, 2026 13:25
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants