2026-06-29-SlimeSearchR1Example

Search-R1 on slime: Training a Qwen2.5-3B Search Agent

This post walks through a full Search-R1 run on slime: RL-training a Qwen2.5-3B base model to use a retrieval tool on open-domain QA, then evaluating it against the pre-RL base on the NQ and HotpotQA test sets.
The short version: 3000 GRPO steps roughly triple NQ exact-match and lift HotpotQA by ~3.4×, and the qualitative logs shows that thRL teaches the model to search and ground its answers instead of guessing from memory.

The task

Search-R1 frames open-domain QA as a multi-turn agentic loop. The model reasons, optionally emits a search query, reads the retrieved passages, and commits to a final answer all in one structured protocol of:

1
2
3
4
<think> … </think>
<search> query </search> # optional, can repeat
<information> Doc 1 … Doc 2 … Doc 3 </information> # injected by the env
<answer> final answer </answer>

slime runs this through a custom generate function (generate_with_search.generate) that interleaves sglang generation with live retrieval calls. Per the config in that file, each rollout allows max 2 search turns and returns top-3 docs/turn from a local dense retriever. The reward is computed by generate_with_search.reward_func, which is the pure exact match of the extracted <answer> against the gold answers.

Training process

Recipe at a glance

Component Choice
Algorithm GRPO (--advantage-estimator grpo)
Base model Qwen2.5-3B (also the KL ref)
Reward Exact match
Training data nq_hotpotqa_train/train.parquet (mixed NQ + HotpotQA)
Rollout steps 3000 (--num-rollout 3000)
Prompts / step 32 (--rollout-batch-size 32)
Samples / prompt 8 (--n-samples-per-prompt 8) → 256 trajectories/step
Global batch size 256
Max response length 512 tokens (--rollout-max-response-len 512)
Rollout temperature 1.0 (exploration during training)
Optimizer Adam, lr 1e-6 constant, wd 0.01, β=(0.9, 0.98)
KL low-var KL to ref, coef 1e-3
Entropy bonus 0.0
PPO clip 0.2 / 0.28 high (clip-higher)
TIS on

Training logs

Reward curve Response length

The model learns to emit effective response with stable response length.

Evaluation

Eval re-uses the exact training generation/reward path with greedy decoding (--eval-temperature 0) and 1 sample per question. The merged test.parquet is split by data_source into:

  • NQ — nq_test.parquet, 3,610 questions
  • HotpotQA — hotpotqa_test.parquet, 7,405 questions

Headline

Model NQ EM HotpotQA EM
Qwen2.5-3B base 0.1620 0.0984
Qwen2.5-3B Search-R1 RL for 3000 steps 0.4576 0.3381
Δ (RL gain) +0.2956 +0.2397

RL training roughly triples NQ EM and lifts HotpotQA EM by ~3.4×.

Metric definition

The score is Exact Match (EM) as implemented in qa_em_format.pycompute_score_em:

  • Extract the model’s final <answer>…</answer>.
  • Normalize (lowercase, strip punctuation, drop articles a/an/the, collapse whitespace as in SQuAD/DPR normalization).
  • Score = 1.0 if it exactly equals any gold answer, else 0.0.

Qualitative comparison

Trained model — uses the search tool, grounds the answer

Question: “who sings the rap in baby by justin bieber?” (gold: Ludacris)

1
2
3
4
5
6
7
8
<think><search>who sings the rap in baby by justin bieber</search>
<information>
Doc 1 (Baby (Justin Bieber song)): …written by Bieber with Christopher "Tricky"
Stewart and Terius "The-Dream" Nash … and also by R&B singer Christina Milian
and labelmate, rapper Ludacris…
Doc 2 … Doc 3 …commended Ludacris' part…
</information>
<answer> Ludacris </answer>

Extracted “Ludacris” = gold. EM = 1. complete search → read → answer with ~40 tokens.

Base model — two typical failure modes

Mode 1 — answers from memory, never searches (often wrong):
Question: “who sings song at the end of beauty and the beast?” (gold: Josh Groban)

1
2
<|im_start|>assistant I found the answer to your question! The song at the end of
Beauty and the Beast is sung by Celine Dion. <answer> Celine Dion </answer>

No <search>. Confidently wrong. EM = 0.

Mode 2 — doesn’t answer, echoes the instructions (degenerate):
Question: (gold: Kentucky)

1
2
3
<|im_start|>assistant I will conduct reasoning inside <think> and </think> first.
I will search for the information if I find that I lack any knowledge. I will
provide the answer inside <answer> and </answer>

Never searches, never answers; scorer extracts a stray “and”. EM = 0.

What the contrast shows

Base Trained
Uses <search> tool Rarely / never Yes, reflexively
Grounds answer in retrieved docs No Yes
Output style Rambles / echoes prompt Concise: search → read → answer (~37 tokens)
Typical result Memorized guess or malformed Retrieved, correct

The lesson the EM reward actually taught is the agentic behavior: on a factual question, issue a search, read the passages, and extract the answer in the <think>/<search>/<information>/<answer> protocol. The base model has the same underlying knowledge but doesn’t invoke the tool or follow the format, so it guesses from memory (Mode 1) or parrots the prompt (Mode 2).
Also note that the </think> is absent from the trained model’s answer, this is because we use pure exact match as the reward and during training the policy learned that reasoning tokens and the </think> close-tag earn nothing. To make the answer more format-valid, we should give some weight to structure_format_score.

Reproduce

Follow the github repo to reproduce the experiment. I setup the local retriever on the host using a conda environment and ran the slime search-r1 training script in the slimerl:slime docker image.

Author

Jiangshan Gong

Posted on

2026-06-29

Updated on

2026-07-01

Licensed under

Comments