Posted 2026-06-29Updated 2026-07-01Study Experiment6 minutes read (About 918 words)

2026-06-29-SlimeSearchR1Example

Search-R1 on slime: Training a Qwen2.5-3B Search Agent

This post walks through a full Search-R1 run on slime: RL-training a Qwen2.5-3B base model to use a retrieval tool on open-domain QA, then evaluating it against the pre-RL base on the NQ and HotpotQA test sets.
The short version: 3000 GRPO steps roughly triple NQ exact-match and lift HotpotQA by ~3.4×, and the qualitative logs shows that thRL teaches the model to search and ground its answers instead of guessing from memory.

The task

Search-R1 frames open-domain QA as a multi-turn agentic loop. The model reasons, optionally emits a search query, reads the retrieved passages, and commits to a final answer all in one structured protocol of:

<think> … </think>
<search> query </search>            # optional, can repeat
<information> Doc 1 … Doc 2 … Doc 3 </information>   # injected by the env
<answer> final answer </answer>

slime runs this through a custom generate function (generate_with_search.generate) that interleaves sglang generation with live retrieval calls. Per the config in that file, each rollout allows max 2 search turns and returns top-3 docs/turn from a local dense retriever. The reward is computed by generate_with_search.reward_func, which is the pure exact match of the extracted <answer> against the gold answers.

Training process

Recipe at a glance

Component	Choice
Algorithm	GRPO (`--advantage-estimator grpo`)
Base model	Qwen2.5-3B (also the KL ref)
Reward	Exact match
Training data	`nq_hotpotqa_train/train.parquet` (mixed NQ + HotpotQA)
Rollout steps	3000 (`--num-rollout 3000`)
Prompts / step	32 (`--rollout-batch-size 32`)
Samples / prompt	8 (`--n-samples-per-prompt 8`) → 256 trajectories/step
Global batch size	256
Max response length	512 tokens (`--rollout-max-response-len 512`)
Rollout temperature	1.0 (exploration during training)
Optimizer	Adam, lr 1e-6 constant, wd 0.01, β=(0.9, 0.98)
KL	low-var KL to ref, coef 1e-3
Entropy bonus	0.0
PPO clip	0.2 / 0.28 high (clip-higher)
TIS	on

Training logs

Reward curve	Response length

The model learns to emit effective response with stable response length.

Evaluation

Eval re-uses the exact training generation/reward path with greedy decoding (--eval-temperature 0) and 1 sample per question. The merged test.parquet is split by data_source into:

NQ — nq_test.parquet, 3,610 questions
HotpotQA — hotpotqa_test.parquet, 7,405 questions

Headline

Model	NQ EM	HotpotQA EM
Qwen2.5-3B base	0.1620	0.0984
Qwen2.5-3B Search-R1 RL for 3000 steps	0.4576	0.3381
Δ (RL gain)	+0.2956	+0.2397

RL training roughly triples NQ EM and lifts HotpotQA EM by ~3.4×.

Metric definition

The score is Exact Match (EM) as implemented in qa_em_format.py → compute_score_em:

Extract the model’s final <answer>…</answer>.
Normalize (lowercase, strip punctuation, drop articles a/an/the, collapse whitespace as in SQuAD/DPR normalization).
Score = 1.0 if it exactly equals any gold answer, else 0.0.

Qualitative comparison

Trained model — uses the search tool, grounds the answer

Question: “who sings the rap in baby by justin bieber?” (gold: Ludacris)

<think><search>who sings the rap in baby by justin bieber</search>
<information>
  Doc 1 (Baby (Justin Bieber song)): …written by Bieber with Christopher "Tricky"
  Stewart and Terius "The-Dream" Nash … and also by R&B singer Christina Milian
  and labelmate, rapper Ludacris…
  Doc 2 … Doc 3 …commended Ludacris' part…
</information>
<answer> Ludacris </answer>

Extracted “Ludacris” = gold. EM = 1. complete search → read → answer with ~40 tokens.

Base model — two typical failure modes

Mode 1 — answers from memory, never searches (often wrong):
Question: “who sings song at the end of beauty and the beast?” (gold: Josh Groban)

1 2	<\|im_start\|>assistant I found the answer to your question! The song at the end of Beauty and the Beast is sung by Celine Dion. <answer> Celine Dion </answer>

No <search>. Confidently wrong. EM = 0.

Mode 2 — doesn’t answer, echoes the instructions (degenerate):
Question: (gold: Kentucky)

1
2
3

<|im_start|>assistant I will conduct reasoning inside <think> and </think> first.
I will search for the information if I find that I lack any knowledge. I will
provide the answer inside <answer> and </answer>

Never searches, never answers; scorer extracts a stray “and”. EM = 0.

What the contrast shows

	Base	Trained
Uses `<search>` tool	Rarely / never	Yes, reflexively
Grounds answer in retrieved docs	No	Yes
Output style	Rambles / echoes prompt	Concise: search → read → answer (~37 tokens)
Typical result	Memorized guess or malformed	Retrieved, correct

The lesson the EM reward actually taught is the agentic behavior: on a factual question, issue a search, read the passages, and extract the answer in the <think>/<search>/<information>/<answer> protocol. The base model has the same underlying knowledge but doesn’t invoke the tool or follow the format, so it guesses from memory (Mode 1) or parrots the prompt (Mode 2).
Also note that the </think> is absent from the trained model’s answer, this is because we use pure exact match as the reward and during training the policy learned that reasoning tokens and the </think> close-tag earn nothing. To make the answer more format-valid, we should give some weight to structure_format_score.

Reproduce

Follow the github repo to reproduce the experiment. I setup the local retriever on the host using a conda environment and ran the slime search-r1 training script in the slimerl:slime docker image.

2026-06-29-SlimeSearchR1Example

http://gong208.github.io/2026/06/29/2026-06-29-SlimeSearchR1Example/

Author

Jiangshan Gong

Posted on

2026-06-29

Updated on

2026-07-01

Licensed under

2026-06-29-SlimeSearchR1Example

Search-R1 on slime: Training a Qwen2.5-3B Search Agent

The task

Training process

Recipe at a glance

Training logs

Evaluation

Headline

Metric definition

Qualitative comparison

Trained model — uses the search tool, grounds the answer

Base model — two typical failure modes

What the contrast shows

Reproduce

Author

Posted on

Updated on

Licensed under

Comments

Recents

Categories

Catalogue