WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge) by eugenevinitsky · Pull Request #497 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-06-28T16:36:09Z

Summary

WIP — adds a Greene nightly launcher for the multi-agent (oignons2-derived) config, mirroring the existing single-agent launcher, plus date-scoped wandb grouping for both.

`scripts/cluster_configs/nightly_best.yaml` — multi-agent config (720k agents, 8 CARLA towns, 10B steps, gigaflow encoder + split network, perturbations on).
`scripts/launch_nightly_best.sh` — Greene launcher; mirrors `launch_single_agent.sh` but defaults `TIME=1800` / `MEM=192gb` to fit the heavier multi-agent profile.
Both launchers now append `wandb_group=${DATE_STAMP}` to `--args` so a night's 3 seeds cluster under one date-named wandb group.
Yamls now point at split wandb projects: `single_agent_speed_run.yaml -> nightly-single`, `nightly_best.yaml -> nightly-multi`.

Mirror the existing single-agent nightly launcher with a multi-agent counterpart so both flavors can run on Greene with the same shape. - scripts/cluster_configs/nightly_best.yaml — multi-agent (oignons2) config (720k agents, 8 maps, 10B steps, gigaflow encoder + split network). - scripts/launch_nightly_best.sh — Greene launcher; same env-overridable shape as launch_single_agent.sh but defaults TIME=1800 / MEM=192gb to match the multi-agent profile. - Both launchers add wandb_group=${DATE_STAMP} to --args so a night's 3 seeds cluster under one date-named wandb group. - Yamls now point at split wandb projects: single_agent_speed_run.yaml -> nightly-single nightly_best.yaml -> nightly-multi Date-scoped grouping happens within each project. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Comments described the file as "derived from oignons2 weights/oignons2/config.yaml" — code-history detail that doesn't help a future reader and breaks once the source branch is gone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vcharraut · 2026-06-29T09:23:19Z

+train.minibatch_size: 153600
+train.max_minibatch_size: 153600


imo the bath size can be simplified to power of 2, with adv filtering and masking you never get perfect bs anyway

So this is not the values you use? We're just trying to match the best run as closely as we can here

vcharraut · 2026-06-29T09:23:48Z

+policy.input_size: 256
+policy.backbone_hidden_size: 1024
+policy.backbone_num_layers: 3
+policy.actor_hidden_size: 1024
+policy.actor_num_layers: 0
+policy.critic_hidden_size: 1024
+policy.critic_num_layers: 0
+policy.split_network: true
+policy.encoder_gigaflow: true
+policy.dropout: 0.0


this should be deprecated, especially the policy.encoder_gigaflow

Fixed in a9b5e29 (with the dead-key removal in b499850): policy.input_size, split_network, encoder_gigaflow, and dropout no longer exist in the [policy] schema and each crashed the launch at argparse time. split_network: true is translated to its successor shared_network: false; the other three have no equivalent and are dropped.

Generated by AI agent

vcharraut · 2026-06-29T09:24:47Z

+env.obs_slots_lane_n: 80
+env.obs_slots_boundary_n: 80


I think the env.obs_slots_boundary_n: 80 can be reduced, since it is not similar to gigaflow implen

Sure! What value do you use in your best runs?

Heads-up: the yaml is now trimmed to only deltas from drive.ini (4b09564), and obs_slots_boundary_n: 80 matched the ini default, so the line is gone from this file. If it should be reduced, that's now either an explicit override here or a drive.ini default change — still curious what value your best runs use.

Generated by AI agent

vcharraut · 2026-06-29T09:25:22Z

+env.simulation_mode: gigaflow
+env.map_dir: pufferlib/resources/drive/binaries/carla
+env.num_maps: 8
+env.num_agents: 720000


this will break the memory i think?

You're right — the obs rollout buffer is num_agents × bptt(128) × obs_dim(828) float32 on the GPU, so 720k agents ≈ 305 GB against the launcher's single 141 GB H200 (bfloat16 is AMP-only and doesn't shrink it). Fixed in a9b5e29: reduced to 153600 (≈65 GB, and it matches minibatch_size so the auto batch splits into exactly 128 minibatches).

Generated by AI agent

Update: the override is dropped entirely in d59d532 — the config now inherits the drive.ini default num_agents rather than pinning a value here.

Generated by AI agent

policy.input_size, policy.split_network, policy.encoder_gigaflow, and policy.dropout no longer exist in drive.ini's [policy] schema, and the pufferl argparser is built strictly from ini keys, so each of these crashed the launch with 'unrecognized arguments'. split_network has a successor with inverted semantics (shared_network: false); the other three have no equivalent and are dropped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

46 of the yaml's keys restated the ini defaults verbatim; the config already works as an override layer, so only the actual differences remain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The observation rollout buffer is num_agents x bptt_horizon x obs_dim float32 on the GPU. At 720000 agents that is ~305 GB against the launcher's single 141 GB H200 (bfloat16 precision is AMP-only and does not shrink it), so allocation fails at startup. 153600 agents puts the buffer at ~65 GB and matches minibatch_size so the auto batch splits into exactly 128 minibatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Fall back to the drive.ini default instead of pinning a value here. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

This was referenced Jun 28, 2026

WIP: Modal nightly cron — stacked on Greene PR (do not merge) #498

Closed

WIP: Modal nightly cron for PufferDrive training (do not merge) #496

Closed

vcharraut reviewed Jun 29, 2026

View reviewed changes

Eugene Vinitsky and others added 5 commits July 3, 2026 17:59

Trim nightly_best.yaml to deltas from drive.ini defaults

4b09564

46 of the yaml's keys restated the ini defaults verbatim; the config already works as an override layer, so only the actual differences remain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Drop num_agents override from nightly_best.yaml

d59d532

Fall back to the drive.ini default instead of pinning a value here. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add best runs config

7f8813a

eugenevinitsky marked this pull request as ready for review July 3, 2026 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497

WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497
eugenevinitsky wants to merge 7 commits into
3.0from
ev/nightly_runs_greene

eugenevinitsky commented Jun 28, 2026 •

edited

Loading

Uh oh!

vcharraut Jun 29, 2026

Uh oh!

eugenevinitsky Jul 1, 2026

Uh oh!

vcharraut Jun 29, 2026

Uh oh!

eugenevinitsky Jul 3, 2026

Uh oh!

vcharraut Jun 29, 2026

Uh oh!

eugenevinitsky Jul 3, 2026

Uh oh!

eugenevinitsky Jul 3, 2026

Uh oh!

vcharraut Jun 29, 2026

Uh oh!

eugenevinitsky Jul 3, 2026

Uh oh!

eugenevinitsky Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		train.minibatch_size: 153600
		train.max_minibatch_size: 153600

		env.obs_slots_lane_n: 80
		env.obs_slots_boundary_n: 80

Uh oh!

Conversation

eugenevinitsky commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eugenevinitsky commented Jun 28, 2026 •

edited

Loading