Skip to content

WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497

Open
eugenevinitsky wants to merge 7 commits into
3.0from
ev/nightly_runs_greene
Open

WIP: Greene nightly multi-agent launcher + date-scoped wandb groups (do not merge)#497
eugenevinitsky wants to merge 7 commits into
3.0from
ev/nightly_runs_greene

Conversation

@eugenevinitsky

@eugenevinitsky eugenevinitsky commented Jun 28, 2026

Copy link
Copy Markdown

Summary

WIP — adds a Greene nightly launcher for the multi-agent (oignons2-derived) config, mirroring the existing single-agent launcher, plus date-scoped wandb grouping for both.

  • `scripts/cluster_configs/nightly_best.yaml` — multi-agent config (720k agents, 8 CARLA towns, 10B steps, gigaflow encoder + split network, perturbations on).
  • `scripts/launch_nightly_best.sh` — Greene launcher; mirrors `launch_single_agent.sh` but defaults `TIME=1800` / `MEM=192gb` to fit the heavier multi-agent profile.
  • Both launchers now append `wandb_group=${DATE_STAMP}` to `--args` so a night's 3 seeds cluster under one date-named wandb group.
  • Yamls now point at split wandb projects: `single_agent_speed_run.yaml -> nightly-single`, `nightly_best.yaml -> nightly-multi`.

Mirror the existing single-agent nightly launcher with a multi-agent
counterpart so both flavors can run on Greene with the same shape.

- scripts/cluster_configs/nightly_best.yaml — multi-agent (oignons2)
  config (720k agents, 8 maps, 10B steps, gigaflow encoder + split
  network).
- scripts/launch_nightly_best.sh — Greene launcher; same env-overridable
  shape as launch_single_agent.sh but defaults TIME=1800 / MEM=192gb to
  match the multi-agent profile.
- Both launchers add wandb_group=${DATE_STAMP} to --args so a night's 3
  seeds cluster under one date-named wandb group.
- Yamls now point at split wandb projects:
    single_agent_speed_run.yaml -> nightly-single
    nightly_best.yaml           -> nightly-multi
  Date-scoped grouping happens within each project.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comments described the file as "derived from oignons2 weights/oignons2/config.yaml" — code-history detail that doesn't help a future reader and breaks once the source branch is gone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +94 to +95
train.minibatch_size: 153600
train.max_minibatch_size: 153600

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo the bath size can be simplified to power of 2, with adv filtering and masking you never get perfect bs anyway

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not the values you use? We're just trying to match the best run as closely as we can here

Comment on lines +80 to +89
policy.input_size: 256
policy.backbone_hidden_size: 1024
policy.backbone_num_layers: 3
policy.actor_hidden_size: 1024
policy.actor_num_layers: 0
policy.critic_hidden_size: 1024
policy.critic_num_layers: 0
policy.split_network: true
policy.encoder_gigaflow: true
policy.dropout: 0.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be deprecated, especially the policy.encoder_gigaflow

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a9b5e29 (with the dead-key removal in b499850): policy.input_size, split_network, encoder_gigaflow, and dropout no longer exist in the [policy] schema and each crashed the launch at argparse time. split_network: true is translated to its successor shared_network: false; the other three have no equivalent and are dropped.

Generated by AI agent

Comment on lines +38 to +39
env.obs_slots_lane_n: 80
env.obs_slots_boundary_n: 80

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the env.obs_slots_boundary_n: 80 can be reduced, since it is not similar to gigaflow implen

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! What value do you use in your best runs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads-up: the yaml is now trimmed to only deltas from drive.ini (4b09564), and obs_slots_boundary_n: 80 matched the ini default, so the line is gone from this file. If it should be reduced, that's now either an explicit override here or a drive.ini default change — still curious what value your best runs use.

Generated by AI agent

env.simulation_mode: gigaflow
env.map_dir: pufferlib/resources/drive/binaries/carla
env.num_maps: 8
env.num_agents: 720000

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break the memory i think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the obs rollout buffer is num_agents × bptt(128) × obs_dim(828) float32 on the GPU, so 720k agents ≈ 305 GB against the launcher's single 141 GB H200 (bfloat16 is AMP-only and doesn't shrink it). Fixed in a9b5e29: reduced to 153600 (≈65 GB, and it matches minibatch_size so the auto batch splits into exactly 128 minibatches).

Generated by AI agent

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the override is dropped entirely in d59d532 — the config now inherits the drive.ini default num_agents rather than pinning a value here.

Generated by AI agent

Eugene Vinitsky and others added 5 commits July 3, 2026 17:59
policy.input_size, policy.split_network, policy.encoder_gigaflow, and
policy.dropout no longer exist in drive.ini's [policy] schema, and the
pufferl argparser is built strictly from ini keys, so each of these
crashed the launch with 'unrecognized arguments'. split_network has a
successor with inverted semantics (shared_network: false); the other
three have no equivalent and are dropped.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
46 of the yaml's keys restated the ini defaults verbatim; the config
already works as an override layer, so only the actual differences
remain.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The observation rollout buffer is num_agents x bptt_horizon x obs_dim
float32 on the GPU. At 720000 agents that is ~305 GB against the
launcher's single 141 GB H200 (bfloat16 precision is AMP-only and does
not shrink it), so allocation fails at startup. 153600 agents puts the
buffer at ~65 GB and matches minibatch_size so the auto batch splits
into exactly 128 minibatches.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fall back to the drive.ini default instead of pinning a value here.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@eugenevinitsky eugenevinitsky marked this pull request as ready for review July 3, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants