Posted 2026-06-29Updated 2026-07-01Study Experiment6 minutes read (About 918 words)2026-06-29-SlimeSearchR1ExampleTraining and evaluation of the slime Search-R1-trained Qwen2.5-3B model on NQ and HotpotQA.Read more
Posted 2026-06-10Updated 2026-07-01Study Experiment16 minutes read (About 2394 words)PPO vs GRPO — Post-Training Qwen2.5-0.5B-Instruct on GSM8K with veRLComparing PPO and GRPO for post-training Qwen2.5-0.5B-Instruct on GSM8K with veRL, focusing on a small-model, verifiable-reward setting where the reward is rule-based correctness rather than a learned reward model.Read more
Posted 2026-06-03Updated 2026-07-01Study Experiment13 minutes read (About 1982 words)2026-06-03-SwingUpCartpoleFrom CartPole-v1 to CartPole Swing-UpRead more