Category: Study Experiment - Jiangshan's Personal Website

Posted 2026-06-29Updated 2026-07-01Study Experiment6 minutes read (About 918 words)

2026-06-29-SlimeSearchR1Example

Training and evaluation of the slime Search-R1-trained Qwen2.5-3B model on NQ and HotpotQA.

Posted 2026-06-10Updated 2026-07-01Study Experiment16 minutes read (About 2394 words)

PPO vs GRPO — Post-Training Qwen2.5-0.5B-Instruct on GSM8K with veRL

Comparing PPO and GRPO for post-training Qwen2.5-0.5B-Instruct on GSM8K with veRL, focusing on a small-model, verifiable-reward setting where the reward is rule-based correctness rather than a learned reward model.

Posted 2026-06-03Updated 2026-07-01Study Experiment13 minutes read (About 1982 words)

2026-06-03-SwingUpCartpole

From CartPole-v1 to CartPole Swing-Up