WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497
WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497eugenevinitsky wants to merge 7 commits into
Conversation
Mirror the existing single-agent nightly launcher with a multi-agent
counterpart so both flavors can run on Greene with the same shape.
- scripts/cluster_configs/nightly_best.yaml — multi-agent (oignons2)
config (720k agents, 8 maps, 10B steps, gigaflow encoder + split
network).
- scripts/launch_nightly_best.sh — Greene launcher; same env-overridable
shape as launch_single_agent.sh but defaults TIME=1800 / MEM=192gb to
match the multi-agent profile.
- Both launchers add wandb_group=${DATE_STAMP} to --args so a night's 3
seeds cluster under one date-named wandb group.
- Yamls now point at split wandb projects:
single_agent_speed_run.yaml -> nightly-single
nightly_best.yaml -> nightly-multi
Date-scoped grouping happens within each project.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comments described the file as "derived from oignons2 weights/oignons2/config.yaml" — code-history detail that doesn't help a future reader and breaks once the source branch is gone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| train.minibatch_size: 153600 | ||
| train.max_minibatch_size: 153600 |
There was a problem hiding this comment.
imo the bath size can be simplified to power of 2, with adv filtering and masking you never get perfect bs anyway
There was a problem hiding this comment.
So this is not the values you use? We're just trying to match the best run as closely as we can here
| policy.input_size: 256 | ||
| policy.backbone_hidden_size: 1024 | ||
| policy.backbone_num_layers: 3 | ||
| policy.actor_hidden_size: 1024 | ||
| policy.actor_num_layers: 0 | ||
| policy.critic_hidden_size: 1024 | ||
| policy.critic_num_layers: 0 | ||
| policy.split_network: true | ||
| policy.encoder_gigaflow: true | ||
| policy.dropout: 0.0 |
There was a problem hiding this comment.
this should be deprecated, especially the policy.encoder_gigaflow
There was a problem hiding this comment.
Fixed in a9b5e29 (with the dead-key removal in b499850): policy.input_size, split_network, encoder_gigaflow, and dropout no longer exist in the [policy] schema and each crashed the launch at argparse time. split_network: true is translated to its successor shared_network: false; the other three have no equivalent and are dropped.
Generated by AI agent
| env.obs_slots_lane_n: 80 | ||
| env.obs_slots_boundary_n: 80 |
There was a problem hiding this comment.
I think the env.obs_slots_boundary_n: 80 can be reduced, since it is not similar to gigaflow implen
There was a problem hiding this comment.
Sure! What value do you use in your best runs?
There was a problem hiding this comment.
Heads-up: the yaml is now trimmed to only deltas from drive.ini (4b09564), and obs_slots_boundary_n: 80 matched the ini default, so the line is gone from this file. If it should be reduced, that's now either an explicit override here or a drive.ini default change — still curious what value your best runs use.
Generated by AI agent
| env.simulation_mode: gigaflow | ||
| env.map_dir: pufferlib/resources/drive/binaries/carla | ||
| env.num_maps: 8 | ||
| env.num_agents: 720000 |
There was a problem hiding this comment.
this will break the memory i think?
There was a problem hiding this comment.
You're right — the obs rollout buffer is num_agents × bptt(128) × obs_dim(828) float32 on the GPU, so 720k agents ≈ 305 GB against the launcher's single 141 GB H200 (bfloat16 is AMP-only and doesn't shrink it). Fixed in a9b5e29: reduced to 153600 (≈65 GB, and it matches minibatch_size so the auto batch splits into exactly 128 minibatches).
Generated by AI agent
There was a problem hiding this comment.
Update: the override is dropped entirely in d59d532 — the config now inherits the drive.ini default num_agents rather than pinning a value here.
Generated by AI agent
policy.input_size, policy.split_network, policy.encoder_gigaflow, and policy.dropout no longer exist in drive.ini's [policy] schema, and the pufferl argparser is built strictly from ini keys, so each of these crashed the launch with 'unrecognized arguments'. split_network has a successor with inverted semantics (shared_network: false); the other three have no equivalent and are dropped. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
46 of the yaml's keys restated the ini defaults verbatim; the config already works as an override layer, so only the actual differences remain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The observation rollout buffer is num_agents x bptt_horizon x obs_dim float32 on the GPU. At 720000 agents that is ~305 GB against the launcher's single 141 GB H200 (bfloat16 precision is AMP-only and does not shrink it), so allocation fails at startup. 153600 agents puts the buffer at ~65 GB and matches minibatch_size so the auto batch splits into exactly 128 minibatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fall back to the drive.ini default instead of pinning a value here. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Summary
WIP — adds a Greene nightly launcher for the multi-agent (oignons2-derived) config, mirroring the existing single-agent launcher, plus date-scoped wandb grouping for both.