Paper: Training-Free Group Relative Policy Optimization By: Youtu-Agent Team Publication date: October 9, 2025
The Problem Fine-tuning LLMs with reinforcement learning (RL) to improve agent performance in specialized domains is expensive, data-hungry, prone to overfitting, and kills cross-domain generalization. Most RL approaches are limited to sub-32B models due to compute constraints.
The Core Idea Instead of updating model parameters (gradient-based RL), Training-Free GRPO updates model context — building an evolving library of experiential knowledge that gets injected into the prompt. The model weights stay frozen throughout.
How It Works The method mirrors vanilla GRPO’s structure but replaces gradient updates with context updates:
- Rollout — For each training query, generate G parallel outputs using the frozen LLM conditioned on the current experience library E
- Reward — Score each output with a reward model (same as standard GRPO)
- Semantic Advantage — Instead of computing a numerical advantage for gradient ascent, the LLM summarizes each trajectory, then compares winners vs. losers to extract natural-language “lessons learned” — the semantic advantage
- Optimization — The experience library E is updated via Add / Delete / Modify / Keep operations based on these lessons. In the next epoch, the enriched E guides better outputs
This repeats for 3 epochs over ~100 training samples.
Results Applied to DeepSeek-V3.1-Terminus (671B) on AIME math benchmarks and WebWalkerQA web search:
| Method | AIME24 | AIME25 | Cost |
|---|---|---|---|
| ReAct baseline | 80.0% | 67.9% | — |
| + Training-Free GRPO | 82.7% | 73.3% | ~$18 |
| ReTool (RL-trained 32B) | 67.0% | 49.3% | ~$10,000 |
Key Advantages
- Cost: ~$18 vs. ~$10,000 for comparable RL fine-tuning
- Data: 100 samples vs. 17,000+
- Generalization: Swapping in a different experience library gives strong performance across both math and web search simultaneously — something parameter-tuned specialists can’t do
- No infrastructure: Works with any frozen API-based model, no dedicated GPU cluster needed
Limitations Effectiveness depends on the underlying model’s baseline capability — results on weaker models (e.g., QwQ-32B on web tasks) were mixed or negative, suggesting a capable base model is a prerequisite.
TL;DR Training-Free GRPO shows that for sufficiently capable LLMs, you can get RL-like performance gains by teaching the prompt rather than the parameters — at a fraction of the cost.