PPO vs GRPO — Post-Training Qwen2.5-0.5B-Instruct on GSM8K with veRL
Comparing PPO and GRPO for post-training Qwen2.5-0.5B-Instruct on GSM8K with veRL, focusing on a small-model, verifiable-reward setting where the reward is rule-based correctness rather than a learned reward model.
Read more