Jiangshan's Personal Website

Posted 2026-06-10Updated 2026-07-01Study Experiment16 minutes read (About 2394 words)

PPO vs GRPO — Post-Training Qwen2.5-0.5B-Instruct on GSM8K with veRL

Comparing PPO and GRPO for post-training Qwen2.5-0.5B-Instruct on GSM8K with veRL, focusing on a small-model, verifiable-reward setting where the reward is rule-based correctness rather than a learned reward model.

Links

Categories

Recents

Archives

Tags