conda create -n agent python=3.10 -y
conda activate agent
# Install dependencies
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
pip3 install -r requirements.txt
# Add verl as a submodule
git submodule add https://github.com/volcengine/verl.git verl
git submodule update --init --recursive --remote
cd verl
pip3 install -e .
cd ..- Sokoban
- Search
- Zebra Puzzle
- Math
- Code
- Clear support for custom environments
- Easy integration with verl via submodule
- Rephrased environment feedback
- Combined SFT and RL loss
- Monte Carlo Tree Search (MCTS)
- Create a new folder under
src/env/(e.g.,myenv/) and implementenv.pyfollowing the examples in other environments. - Your environment class should implement at least:
run(responses_str, batch, chat_template): main agent-environment interaction.get_reward_allocation(reward_tensors): (optional) custom reward allocation.
- Register your environment in
src/env/__init__.pyand add a branch in theget_envfunction. - Specify your environment in training scripts with
+env.name=your_env.
Each environment provides a create_dataset.py script for generating training and test datasets. For example:
cd src/env/sokoban
python create_dataset.py --output dataset/sokoban --train_size 10000 --test_size 100Other environments (math, zebra, code) are similar. See each script for available arguments.
For example, to train on the math environment (see train_math_ppo.sh for details):
bash train_math_ppo.shOr run the main program directly:
python -m src.core.main_ppo +env.name=math ...See the training scripts for all configurable parameters.