2026-06-29-SlimeSearchR1Example
Search-R1 on slime: Training a Qwen2.5-3B Search Agent
This post walks through a full Search-R1 run on slime: RL-training a Qwen2.5-3B base model to use a retrieval tool on open-domain QA, then evaluating it against the pre-RL base on the NQ and HotpotQA test sets.
The short version: 3000 GRPO steps roughly triple NQ exact-match and lift HotpotQA by ~3.4×, and the qualitative logs shows that thRL teaches the model to search and ground its answers instead of guessing from memory.
The task
Search-R1 frames open-domain QA as a multi-turn agentic loop. The model reasons, optionally emits a search query, reads the retrieved passages, and commits to a final answer all in one structured protocol of:
1 | <think> … </think> |
slime runs this through a custom generate function (generate_with_search.generate) that interleaves sglang generation with live retrieval calls. Per the config in that file, each rollout allows max 2 search turns and returns top-3 docs/turn from a local dense retriever. The reward is computed by generate_with_search.reward_func, which is the pure exact match of the extracted <answer> against the gold answers.
Training process
Recipe at a glance
| Component | Choice |
|---|---|
| Algorithm | GRPO (--advantage-estimator grpo) |
| Base model | Qwen2.5-3B (also the KL ref) |
| Reward | Exact match |
| Training data | nq_hotpotqa_train/train.parquet (mixed NQ + HotpotQA) |
| Rollout steps | 3000 (--num-rollout 3000) |
| Prompts / step | 32 (--rollout-batch-size 32) |
| Samples / prompt | 8 (--n-samples-per-prompt 8) → 256 trajectories/step |
| Global batch size | 256 |
| Max response length | 512 tokens (--rollout-max-response-len 512) |
| Rollout temperature | 1.0 (exploration during training) |
| Optimizer | Adam, lr 1e-6 constant, wd 0.01, β=(0.9, 0.98) |
| KL | low-var KL to ref, coef 1e-3 |
| Entropy bonus | 0.0 |
| PPO clip | 0.2 / 0.28 high (clip-higher) |
| TIS | on |
Training logs
| Reward curve | Response length |
|---|---|
![]() |
![]() |
The model learns to emit effective response with stable response length.
Evaluation
Eval re-uses the exact training generation/reward path with greedy decoding (--eval-temperature 0) and 1 sample per question. The merged test.parquet is split by data_source into:
- NQ —
nq_test.parquet, 3,610 questions - HotpotQA —
hotpotqa_test.parquet, 7,405 questions
Headline
| Model | NQ EM | HotpotQA EM |
|---|---|---|
| Qwen2.5-3B base | 0.1620 | 0.0984 |
| Qwen2.5-3B Search-R1 RL for 3000 steps | 0.4576 | 0.3381 |
| Δ (RL gain) | +0.2956 | +0.2397 |
RL training roughly triples NQ EM and lifts HotpotQA EM by ~3.4×.
Metric definition
The score is Exact Match (EM) as implemented in qa_em_format.py → compute_score_em:
- Extract the model’s final
<answer>…</answer>. - Normalize (lowercase, strip punctuation, drop articles a/an/the, collapse whitespace as in SQuAD/DPR normalization).
- Score = 1.0 if it exactly equals any gold answer, else 0.0.
Qualitative comparison
Trained model — uses the search tool, grounds the answer
Question: “who sings the rap in baby by justin bieber?” (gold: Ludacris)
1 | <think><search>who sings the rap in baby by justin bieber</search> |
Extracted “Ludacris” = gold. EM = 1. complete search → read → answer with ~40 tokens.
Base model — two typical failure modes
Mode 1 — answers from memory, never searches (often wrong):
Question: “who sings song at the end of beauty and the beast?” (gold: Josh Groban)
1 | <|im_start|>assistant I found the answer to your question! The song at the end of |
No <search>. Confidently wrong. EM = 0.
Mode 2 — doesn’t answer, echoes the instructions (degenerate):
Question: (gold: Kentucky)
1 | <|im_start|>assistant I will conduct reasoning inside <think> and </think> first. |
Never searches, never answers; scorer extracts a stray “and”. EM = 0.
What the contrast shows
| Base | Trained | |
|---|---|---|
Uses <search> tool |
Rarely / never | Yes, reflexively |
| Grounds answer in retrieved docs | No | Yes |
| Output style | Rambles / echoes prompt | Concise: search → read → answer (~37 tokens) |
| Typical result | Memorized guess or malformed | Retrieved, correct |
The lesson the EM reward actually taught is the agentic behavior: on a factual question, issue a search, read the passages, and extract the answer in the <think>/<search>/<information>/<answer> protocol. The base model has the same underlying knowledge but doesn’t invoke the tool or follow the format, so it guesses from memory (Mode 1) or parrots the prompt (Mode 2).
Also note that the </think> is absent from the trained model’s answer, this is because we use pure exact match as the reward and during training the policy learned that reasoning tokens and the </think> close-tag earn nothing. To make the answer more format-valid, we should give some weight to structure_format_score.
Reproduce
Follow the github repo to reproduce the experiment. I setup the local retriever on the host using a conda environment and ran the slime search-r1 training script in the slimerl:slime docker image.
2026-06-29-SlimeSearchR1Example
http://gong208.github.io/2026/06/29/2026-06-29-SlimeSearchR1Example/

