Skip to content

Allow and handle None rewards (miles side)#52

Draft
flukeskywalker wants to merge 1 commit into
prodfrom
training-semantics
Draft

Allow and handle None rewards (miles side)#52
flukeskywalker wants to merge 1 commit into
prodfrom
training-semantics

Conversation

@flukeskywalker

@flukeskywalker flukeskywalker commented Jun 24, 2026

Copy link
Copy Markdown
  • Treat Sample.remove_sample=True as a training-semantic removal, not only a loss-mask change.
  • Exclude removed samples from default GRPO/GSPO/reinforce++-baseline reward normalization and keep processed rewards/advantages/returns neutral.
  • Preserve shape in rollout/train artifacts with explicit remove_samples metadata and allow zero loss masks only for explicitly removed samples.
  • Update --rollout-sample-filter-path help/docs and add focused tests for normalization and zero-mask validation.

Why

With https://github.com/LLM360/RL360/pull/427, RL360 will mark no-verifier/no-reward Harbor samples with remove_sample=True. These will have reward=None after https://github.com/LLM360/RL360/pull/415 is merged. Miles needs that marker to mean the sample is excluded from loss as well as training math instead of still influencing group reward baselines or advantage normalization.

TODO: make removal of such samples from reward normalization optional, defaulting to False to preserve current behavior.

@flukeskywalker flukeskywalker changed the base branch from main to prod June 24, 2026 08:01
@flukeskywalker flukeskywalker changed the title [codex] Implement removed sample training semantics Allow and handle None rewards (miles side) Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant