Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,13 +1,43 @@
# πŸ“… Jun 2, 2026
# Meeting Notes β€” Jun 2, 2026

πŸ“½οΈ [Recording](https://www.youtube.com/watch?v=WPZKYE3S_TM)
πŸ€– [AI Summary](https://zoom-lfx.platform.linuxfoundation.org/meeting/99965231171-1780412400000/summaries?password=25080193-f523-4c6f-b179-5c179db93c76)

## πŸ‘₯ Attendees
_(see Google Doc for attendees list)_

## πŸ“ Quick Recap

The meeting focused on discussing two approaches for running Slurm multi-cluster in Kubernetes: using Kueue with Slurm Bridge and using Armada with Slurm. Kevin demonstrated the Kueue integration, showing how it acts as an admission engine to prevent overwhelming etcd while allowing Slurm to handle actual scheduling, requiring only configuration changes rather than code modifications. Alex then demonstrated Armada's multi-cluster capabilities, showing how it can distribute jobs across multiple Slurm clusters without overwhelming the control plane, requiring about 30 lines of code changes to disable node binding. The discussion included perspectives from new participants like Shiny from the National Research Council of Canada, who explained their need to connect large-scale radio telescope resources across multiple data centers using both Slurm and Kubernetes. The conversation also touched on GPU scheduling differences between Slurm's GRES model and Kubernetes implementations, though specific comparisons remained unclear due to limited knowledge of Volcano's approach.

## ➑️ Next Steps

- Alex: Address the comments on the white paper PR and work towards publication, then coordinate with CNCF marketing to promote the published white papers.
- Filip: Continue assigning tasks and sections for the benchmarking initiative, and follow up on the KubeCon application status.
- Marlow: Reach out to the community to gather existing benchmarking solutions and engage interested parties for the benchmarking initiative.
- Marlow: Send the correct documentation to Shiny regarding Slurm Bridge setup.
- Alex: Prepare for the focused session on the benchmarking initiative for the next meeting, inviting interested parties to attend.
- Kevin: File an issue regarding the Kueue workload object displaying the wrong scheduler name due to the mutating webhook.
- Alex: Share the setup documentation for Slurm Bridge with Shiny.

## πŸ“‹ Summary

### Team Reunion and Updates Meeting

The meeting began with casual conversation about travel, where Mesut discussed his recent trip to Finland for a Kubernetes event. Kevin joined the meeting after recovering from illness, and several other participants including Nathan, Dan, and Pavan were welcomed back to the group. The meeting was being facilitated by Alex, who mentioned that Kevin and another participant would be discussing "some slurmy stuff with Kueue and with Armada" later.

### Project Updates and Benchmarking Plans

Alex provided updates on logistical organization, including the completion of repository restructuring with indexed meeting notes and a dedicated space for initiatives like white papers and benchmarking. The white paper publication process is ongoing, with pending PR comments that need addressing before marketing outreach to CNCF. The team discussed plans for a focused benchmarking initiative session at the next meeting, with Marlow working to engage the community and identify existing benchmark resources. The conversation ended with plans to discuss Slurm multi-cluster approaches, specifically exploring options with Kueue and Slurm, as well as Armada and Slurm solutions.

### Slurm Bridge Integration with Kueue

Kevin demonstrated a Slurm bridge integration with Kueue that allows using Slurm as a secondary scheduler while maintaining Kueue's admission control capabilities. He explained that the integration uses label selectors for opt-in namespaces and involves Kueue acting as an emission engine that admits jobs and patches them to direct them to Slurm. Kevin noted one technical issue where Kueue's workload object displays the wrong scheduler name due to the timing of mutating webhook operations, though he was unsure if this could be addressed.

### Kueue-Slurm Integration Challenges Discussion

- Alex Scammon (G-Research) [host]
- Abhishek Malvankar (Red Hat) [host]
- Marlow Warnicke (NVIDIA) [host]
- Mesut Oezdil (Open Source Contributor)
Kevin explained that while Kueue can integrate with Slurm, there are challenges with workload reporting and admission control when using mutating webhooks. Abhishek asked about limitations with Kueue's features when used with Slurm, to which Kevin confirmed that advanced features like preemption and fair sharing wouldn't be effective since Slurm handles those separately. Shiny, a new participant from the National Research Council of Canada working on radio telescope infrastructure, introduced herself and asked about the setup requirements for Slurm Bridge, to which Marlow clarified that it doesn't need to run on every Slurm node and can be configured to work with external units.

## πŸ“‹ Agenda
### Slurm-Kubernetes Integration Discussion

- πŸ‘‹ Welcome & Introductions ("Hello, why are you here?")
- Kueue + Slurm?
The meeting focused on discussing the integration of Slurm with Kubernetes using tools like Kueue and Armada. Kevin explained the need for Kueue to manage job submission and prevent etcd from overwhelming, particularly in scenarios with mixed short- and long-lived tasks. Alex demonstrated Armada's multi-cluster capabilities, showing how it can manage both Slurm and Kubernetes clusters efficiently. The discussion also touched on the reasons organizations are moving from bare-metal Slurm installations to Kubernetes, including operational simplification and the need for a unified compute platform. Participants shared perspectives on GPU scheduling models in Slurm and Kubernetes, with Kevin noting differences between MIG and RunAI approaches. The conversation ended with a reminder about an upcoming session focused on benchmarking different workloads and batch schedulers.
Loading