Skip to content

eth-sri/saturated-learning

Repository files navigation

Saturated Learning

Training and rerunning saturated learning experiments.

Python: >=3.10 Managed with uv License: MIT

👋 Overview

This repository contains the scripts needed to score generated traces, convert scores into preference data, and run DPO or RRHF-style training experiments.

The minimal workflow is:

  1. Score traces with entropy and self-judging signals.
  2. Convert scored traces into training datasets.
  3. Train adapters from the resulting preference or ranked data.

🚀 Setup

uv sync

Create a .env file with your Hugging Face namespace and credentials:

HF_NAME=your-huggingface-name
HF_TOKEN=your-huggingface-token
WANDB_API_KEY=your-wandb-api-key

🧪 Minimal Experiment

The example below assumes HF_NAME points to your Hugging Face namespace and the source dataset has question, generations, and correct_mask columns.

export MODEL=Qwen/Qwen3-1.7B-Base
export SRC_GENERATIONS=$HF_NAME/chainsum_generations
export TAG=chainsum_minimal_s42
export MAX_LEN=2048
export JUDGE_MAX_LEN=8192

# 1. Score the generated traces.
uv run python score_entropy.py \
  --model "$MODEL" \
  --dataset "$SRC_GENERATIONS" \
  --output "$HF_NAME/${TAG}_sw_entropy" \
  --max_seq_length "$MAX_LEN"

uv run python score_judge.py \
  --judge_model "$MODEL" \
  --dataset "$SRC_GENERATIONS" \
  --output "$HF_NAME/${TAG}_sw" \
  --max_model_len "$JUDGE_MAX_LEN" \
  --tensor_parallel_size 1

# 2. Convert scores into training data.
uv run python make_data.py inv_entropy \
  --src "$HF_NAME/${TAG}_sw_entropy" \
  --out "$HF_NAME/${TAG}_inv_entropy"

uv run python make_data.py dpo_pairs \
  --src "$HF_NAME/${TAG}_inv_entropy" \
  --out "$HF_NAME/${TAG}_entropy_dpo"

uv run python make_data.py dpo_pairs \
  --src "$HF_NAME/${TAG}_sw" \
  --out "$HF_NAME/${TAG}_sjudge_dpo"

# 3. Run a training experiment.
uv run python -m trl.scripts.dpo \
  --model_name_or_path "$MODEL" \
  --dataset_name "$HF_NAME/${TAG}_sjudge_dpo" \
  --output_dir "outputs/${TAG}_sjudge_DPO" \
  --learning_rate 5e-5 \
  --num_train_epochs 1 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --gradient_checkpointing \
  --use_peft \
  --lora_r 32 \
  --lora_alpha 64 \
  --lora_target_modules q_proj k_proj v_proj o_proj \
  --max_length "$MAX_LEN" \
  --beta 5.0 \
  --loss_type sigmoid

uv run python train_rrhf.py \
  --model "$MODEL" \
  --dataset "$HF_NAME/${TAG}_sw" \
  --output_dir "outputs/${TAG}_sjudge_RRHF-BT" \
  --bt_reweight \
  --rank_weight 0.1 \
  --per_device_train_batch_size 16 \
  --lora_r 32 \
  --lora_alpha 64 \
  --max_seq_length "$MAX_LEN"

🏃 Convenience Script

For the full default run, use run.sh with the GPU index or comma-separated GPU list. The second argument is the seed and defaults to 42.

bash run.sh 0 42

Outputs are written under outputs/, evaluation summaries under output/eval/, and command logs under logs/.

About

Learning from Saturated Data: experiment scripts with scoring, data generation and training.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors