Self-Evolving Search Agents: How LLMs Learn Without Training Data
Exploring Dr. Zero's framework where LLM agents bootstrap their own training data through self-play, enabling continuous improvement without human annotation.
Table of Contents
- Introduction: The Data Bottleneck Problem
- The Diminishing Returns of Data
- Self-Evolution: A New Paradigm
- The Core Challenge: Why Search Agents Are Different
- Multi-Turn Reasoning Complexity
- Tool Use and External Knowledge
- Why Existing Frameworks Fall Short
- Dr. Zero: Framework Overview
- The Key Insight
- Architecture Components
- The Self-Evolution Feedback Loop
- The Proposer: Generating Training Data from Nothing
- Multi-Turn Tool-Use Rollout
- Difficulty-Guided Rewards
- Automated Curriculum Learning
- The Solver: Learning to Search and Reason
- Multi-Step Reasoning Pipeline
- Learning from Synthetic Questions
- The Co-Evolution Dynamic
- Hop-Grouped Relative Policy Optimization (HRPO)
- The Problem with Standard GRPO
- HRPO: A Smarter Grouping Strategy
- Mathematical Formulation
- Technical Deep Dive: The Mathematics
- Policy Gradient Foundation
- Advantage Estimation
- Reward Shaping for Multi-Hop Questions
- Experimental Setup
- Models and Training
- Benchmarks
- Baselines
- Results and Analysis
- Main Results
- Ablation Studies
- Scaling Properties
- Implications: The Future of Self-Evolving AI
- Beyond Search Agents
- Limitations and Open Questions
- The Bigger Picture
- References
We're running out of data. Not in some abstract, distant-future sense—right now, today, the AI industry is hitting a wall. The internet has been scraped, books have been digitized, and the low-hanging fruit of human-generated text has largely been picked. Yet models keep getting bigger, and their appetite for training data grows exponentially. Something has to give.
What if models could train themselves? What if, instead of relying on expensive human annotation and increasingly scarce high-quality data, language models could autonomously generate their own training problems, solve them, and learn from the experience? This isn't science fiction—it's the frontier of AI research right now. And a recent paper from Meta's Superintelligence Labs, Dr. Zero (DeepResearch-Zero), demonstrates that this vision is not only possible but can match or even surpass fully supervised approaches.
Introduction: The Data Bottleneck Problem
The Diminishing Returns of Data
The scaling laws that have driven AI progress for the past decade are brutally simple: more data plus more compute equals better models. This equation worked beautifully when the internet was an untapped goldmine. But we've mined most of it now. Studies estimate that high-quality text data on the internet amounts to roughly 300 billion tokens of truly high-quality content. Current frontier models are trained on trillions of tokens, which means we're already recycling and reusing data multiple times.
The problem compounds when you need specialized capabilities. Training a model to do multi-step research—the kind where it needs to search for information, reason about what it finds, search again based on that reasoning, and synthesize a final answer—requires datasets of exactly that behavior. Such datasets are extraordinarily expensive to create. You need human experts who can demonstrate the entire reasoning chain, annotate what they're thinking at each step, and do this thousands of times across diverse topics.
This is where self-evolution enters the picture.
Self-Evolution: A New Paradigm
The core idea of self-evolving AI is elegantly simple: use the model to generate training data for itself. A proposer component generates problems, and a solver component attempts to solve them. The solver's attempts are evaluated (sometimes by the model itself, sometimes by external verification), and successful reasoning traces become training data. As the solver improves, it pushes the proposer to generate harder problems, creating an automated curriculum that progressively builds capability.
This paradigm has shown remarkable success in constrained domains like mathematics. Mathematical problems have clear answers—you can verify whether 2+2 equals 4 without human annotation. Models like those in the "STaR" (Self-Taught Reasoner) family have demonstrated that starting from a small seed of examples, models can bootstrap their way to impressive mathematical reasoning through self-generated training data.
But mathematics is the easy case. The problems are well-defined, the solution space is constrained, and verification is straightforward. What happens when we try to apply self-evolution to something messier—like open-domain question answering that requires searching the web, synthesizing information from multiple sources, and reasoning across facts that the model hasn't memorized?
This is precisely the challenge that Dr. Zero tackles.
The Core Challenge: Why Search Agents Are Different
Multi-Turn Reasoning Complexity
Consider a seemingly simple question: "What year did the director of Inception also release a film that won the Academy Award for Best Picture?"
Answering this requires multiple steps:
- Identify the director of Inception (Christopher Nolan)
- Search for Christopher Nolan's filmography
- Identify which of his films won Best Picture (none, actually—trick question, but the process matters)
- If one existed, return the year
Each step depends on the previous one. The search query at step 2 can't be formulated until step 1 completes. The reasoning at step 3 requires the results from step 2. This sequential dependency creates a combinatorial explosion in the space of possible reasoning traces.
For self-evolution to work, the proposer needs to generate questions that exhibit this multi-hop structure. But here's the catch: if you just ask a language model to "generate hard questions," it tends to produce questions that sound complex but are actually answerable in a single search. The model doesn't naturally understand what makes a question genuinely require multi-step reasoning.
Tool Use and External Knowledge
Search agents don't just reason—they interact with external tools. Each tool call is an action that produces an observation, and the model must decide what to do next based on that observation. This creates a vastly larger action space compared to pure reasoning tasks.
In mathematical self-evolution, the model generates a problem, attempts a solution, and can verify correctness purely through computation. In search tasks, verification requires access to ground truth that may not exist in any structured form. How do you know if "Christopher Nolan directed Inception" is correct without looking it up? And if you look it up, you need a search engine—the same tool the agent is trying to learn to use.
This creates a bootstrapping problem: the agent needs to use search to generate training data for learning to search.
Why Existing Frameworks Fall Short
Prior work on data-free self-evolution, while groundbreaking for mathematical reasoning, struggles with search agents for three main reasons:
Limited Question Diversity: Trained proposers exhibit a strong bias toward simple, one-hop questions. If you ask a model to generate questions and reward it for questions the solver can answer, it quickly learns to generate easy questions. The path of least resistance leads to a degenerate curriculum where the proposer and solver collude in mediocrity.
No Progressive Difficulty: Without explicit mechanisms to increase difficulty as the solver improves, the system reaches equilibrium too early. The proposer generates questions at the edge of the solver's current capability, but that edge never advances because there's no pressure to push beyond it.
Compute Overhead: Standard training algorithms like Group Relative Policy Optimization (GRPO) require sampling multiple responses for each question to estimate baselines. For search agents, each response involves multiple tool calls, each of which requires actual search engine queries. The compute cost of nested sampling—multiple questions, each with multiple response trajectories, each with multiple search calls—becomes prohibitive.
Dr. Zero addresses all three of these challenges.
Dr. Zero: Framework Overview
The Key Insight
The fundamental insight of Dr. Zero is that external search engines can serve as both a tool for the agent and a source of supervision signals for training. This dual role eliminates the need for human-curated training data.
When the proposer generates a question, it does so by first searching for factual information, then constructing a question that requires reasoning across multiple facts. The search engine provides the ground truth answer. When the solver attempts to answer, the search engine provides the information needed to reason toward (hopefully) the same answer. The match between the solver's answer and the proposer's intended answer provides the reward signal.
This creates a closed loop where the search engine is the oracle, the proposer is the teacher, and the solver is the student—all without any human in the loop.
Architecture Components
Dr. Zero consists of two main components initialized from the same base language model:
The Proposer: Generates diverse, multi-hop questions along with their ground truth answers. The proposer is trained to produce questions that are (1) answerable using web search, (2) require multiple reasoning steps, and (3) are at the appropriate difficulty level for the current solver.
The Solver: Learns to answer questions through multi-turn interaction with a search engine. The solver generates reasoning traces, issues search queries, processes results, and produces final answers. It's trained on the synthetic questions generated by the proposer.
Both components evolve together. As the solver gets better, the proposer's reward function encourages harder questions. As the proposer generates harder questions, the solver is pushed to develop more sophisticated reasoning strategies.
Dr. Zero self-evolution overview: Proposer and Solver co-evolve across iterations
The Self-Evolution Feedback Loop
The training proceeds in iterations:
-
Proposer Rollout: The proposer uses the search engine to gather factual information and generates a batch of multi-hop questions with answers.
-
Solver Training: The solver attempts to answer the proposer's questions using the search engine. Successful reasoning traces (those that reach the correct answer) are used as positive training examples.
-
Proposer Training: The proposer is updated based on a difficulty-guided reward. Questions that are too easy (solver always succeeds) or too hard (solver always fails) receive lower rewards than questions at the boundary of the solver's capability.
-
Repeat: The improved solver changes the difficulty landscape, incentivizing the proposer to adapt, which generates new training data for the next round.
Self-Evolution Feedback Loop: Proposer generates QA pairs, Solver predicts answers, rewards update both models
This feedback loop creates an automated curriculum that progressively builds capability without any human intervention.
The Proposer: Generating Training Data from Nothing
Multi-Turn Tool-Use Rollout
The proposer doesn't just generate questions from its parametric knowledge—it actively searches the web to construct factually grounded, multi-hop questions. This is crucial for ensuring that the generated questions are actually answerable and have verifiable ground truth answers.
The process works as follows:
-
Seed Selection: The proposer receives a seed topic or document title (e.g., "Christopher Nolan" or "Australian Labor Party").
-
Information Gathering: The proposer issues search queries to gather facts about the seed topic. It might learn that Christopher Nolan directed several films, won various awards, was born in London, etc.
-
Hop Construction: The proposer chains facts together to create multi-hop questions. Starting from "Christopher Nolan," it might follow: Nolan → directed Dunkirk → Dunkirk is about WWII → WWII ended in 1945. A resulting question might be: "In what year did the war that is the subject of a Christopher Nolan film end?"
-
Answer Extraction: The final fact in the chain provides the ground truth answer (1945 in this case).
This multi-turn rollout is essential for generating genuinely complex questions. Single-turn generation tends to produce questions that sound multi-hop but can actually be answered directly.
Difficulty-Guided Rewards
The proposer's training reward has a crucial property: it encourages questions that are solvable but challenging. This is formalized as:
This reward is maximized when the solver's success rate is 50%—questions that are neither trivially easy nor impossibly hard. Questions with 100% success rate (too easy) and 0% success rate (too hard) both receive zero reward.
In practice, the framework incorporates additional factors:
where:
- hop_complexity measures how many reasoning steps the question requires
- solvability penalizes questions that are consistently unsolvable
- diversity encourages variety in topics and question structures
The weights , , and are tuned to balance these competing objectives.
Automated Curriculum Learning
As training progresses, something remarkable happens: the optimal difficulty level automatically increases. In early iterations, the solver is weak, so the proposer is rewarded for generating relatively simple 2-hop questions. As the solver improves, those 2-hop questions become too easy (success rate approaches 100%), and the proposer's reward drops. The only way to maintain high reward is to generate harder, 3-hop or 4-hop questions.
This creates an automated curriculum without any explicit difficulty scheduling. The curriculum emerges from the interaction between the proposer's reward function and the solver's evolving capability. The researchers observe that over training iterations, the average number of hops in generated questions increases naturally—the system discovers its own curriculum.
The Solver: Learning to Search and Reason
Multi-Step Reasoning Pipeline
The solver follows a structured approach to answering questions:
- Think: Generate reasoning about what information is needed
- Search: Issue a query to the search engine
- Process: Analyze the search results and extract relevant facts
- Iterate: Decide whether to search again or produce a final answer
- Answer: Generate the final response
This loop continues until the solver is confident enough to answer or reaches a maximum number of search iterations. The entire trajectory—including all reasoning steps, search queries, and the final answer—forms the training data when the answer is correct.
The solver learns not just what to search for, but how to reason about search results. It must handle noisy or irrelevant results, synthesize information from multiple sources, and maintain coherent reasoning across multiple turns.
Learning from Synthetic Questions
The solver is trained using reinforcement learning on the proposer's generated questions. The reward is simple: +1 for correct answers, 0 for incorrect answers. But the learning signal comes from comparing multiple reasoning trajectories.
For each question, the solver generates several candidate reasoning traces. Traces that reach the correct answer are treated as positive examples (high advantage), while traces that fail are negative examples (low advantage). The policy gradient then pushes the solver toward reasoning strategies that succeed and away from strategies that fail.
Crucially, the solver never sees human-generated reasoning traces. All its training data comes from its own attempts on synthetic questions. The "teacher" is the reward signal, not human demonstration.
The Co-Evolution Dynamic
The solver and proposer are locked in a productive adversarial relationship. The proposer tries to generate questions that challenge the solver, and the solver tries to answer any question the proposer throws at it.
This dynamic prevents several failure modes:
Preventing Collapse to Easy Questions: If the proposer generated only easy questions, the solver would quickly achieve 100% success, and the proposer's reward would drop. The proposer is thus incentivized to stay ahead of the solver.
Preventing Collapse to Impossible Questions: If the proposer generated impossibly hard questions, the solver would achieve 0% success, and again the proposer's reward would drop. The proposer must ensure questions remain solvable.
Encouraging Genuine Difficulty: The sweet spot—questions that the solver can solve with effort—provides the highest reward. This drives the proposer to find genuinely challenging but fair questions.
The result is a self-regulating system where both agents push each other to improve.
Hop-Grouped Relative Policy Optimization (HRPO)
The Problem with Standard GRPO
Group Relative Policy Optimization (GRPO) has become a popular algorithm for training language models with reinforcement learning. Instead of learning a separate value function (like in PPO), GRPO estimates advantages by comparing a response to other responses for the same question. This eliminates the need for a critic network and simplifies training.
The standard GRPO formulation samples multiple responses for each question and computes advantages as:
where is the reward for response , and the mean reward serves as the baseline.
For search agents, this creates a problem: each response involves multiple search queries, making nested sampling extremely expensive. If you sample questions and responses per question, you need complete reasoning traces. With each trace involving multiple search calls, compute costs explode.
HRPO: A Smarter Grouping Strategy
HRPO (Hop-grouped Relative Policy Optimization) introduces a clever insight: instead of grouping responses to the same question, group responses to structurally similar questions.
Questions can be characterized by their "hop count"—how many reasoning steps they require. A 2-hop question like "Who directed the film where Leonardo DiCaprio played a dream thief?" is structurally similar to another 2-hop question like "What year was the director of Titanic born?" Even though the topics differ, the reasoning structure is analogous.
HRPO groups questions by their hop count and computes advantages within these groups:
where is the set of all responses to questions with hop count .
This seemingly simple change has profound implications for efficiency.
Mathematical Formulation
The full HRPO objective can be written as:
where:
- is the distribution of proposer-generated questions
- is the current solver policy
- is the policy at the start of the update
- is the hop group containing question
- is the advantage computed using the hop group baseline
The key insight is that contains responses from many different questions, so each question only needs one sampled response. The baseline is computed from the entire hop group rather than from multiple responses to the same question.
This reduces the sampling requirement from responses to just responses—a factor of improvement in sample efficiency.
Why does this work? Questions with the same hop count have similar difficulty and require similar reasoning strategies. A 3-hop question about history is about as hard as a 3-hop question about science, at least on average. Using the hop group mean as a baseline thus provides a reasonable estimate of expected performance, even without sampling multiple responses to each individual question.
Technical Deep Dive: The Mathematics
Policy Gradient Foundation
The entire framework rests on the policy gradient theorem. For a policy parameterized by , the gradient of expected reward is:
where is a complete trajectory (all reasoning steps and search queries) and is the trajectory reward.
In practice, we use the advantage instead of raw rewards to reduce variance:
For the solver, the trajectory is the complete reasoning trace, and the reward is 1 for correct answers, 0 otherwise. For the proposer, the trajectory is the question-generation process, and the reward is the difficulty-guided signal described earlier.
Advantage Estimation
The HRPO advantage estimation provides a crucial variance reduction over naive policy gradients. Consider two approaches:
Per-Question Baseline (standard GRPO):
Hop-Group Baseline (HRPO):
The GRPO baseline is more accurate (it's conditioned on the specific question) but requires samples per question. The HRPO baseline is an approximation but only requires 1 sample per question.
The approximation quality depends on how well hop count predicts difficulty. Empirically, the researchers find that hop count is a strong predictor—variance within hop groups is much lower than variance across the entire dataset.
Reward Shaping for Multi-Hop Questions
For the proposer, the raw reward signal (solver success rate) is sparse and noisy. To improve learning, Dr. Zero incorporates shaped rewards:
where:
- hop_bonus: Extra reward for generating questions with more hops (encourages difficulty)
- format_bonus: Reward for well-formed questions that include proper entity references
- search_usage: Reward for actually using the search engine during question generation (prevents degenerate solutions where the proposer generates questions from parametric memory only)
These shaping terms guide the proposer toward desirable behaviors without changing the ultimate objective. The base reward still dominates—shaping just provides helpful gradients early in training when the base signal is too sparse.
Experimental Setup
Models and Training
Dr. Zero experiments use Qwen2.5-Instruct models as the base for both proposer and solver, with experiments conducted at 3B and 7B parameter scales. This choice balances capability with computational tractability—these models are small enough for rapid iteration but large enough to exhibit meaningful reasoning behavior.
Training proceeds in multiple iterations:
- Iteration 1: Initialize both proposer and solver from the base model
- Iteration 2+: Each iteration uses the previous iteration's trained models as starting points
Each iteration involves:
- Generate ~10,000 questions using the current proposer
- Train the solver on these questions for several epochs
- Update the proposer based on solver performance
- Evaluate on held-out benchmarks
The total compute is substantial but tractable—roughly 50-100 GPU-hours per iteration on modern hardware.
Benchmarks
Evaluation uses a comprehensive suite of question answering benchmarks spanning both single-hop and multi-hop reasoning:
Single-Hop Benchmarks:
- Natural Questions (NQ): Real questions from Google search, testing factual retrieval
- TriviaQA: Trivia questions with evidence from Wikipedia and the web
- PopQA: Questions about popular entities, testing common knowledge retrieval
Multi-Hop Benchmarks:
- HotpotQA: Multi-hop questions requiring reasoning across Wikipedia paragraphs. Questions typically require 2-3 hops.
- 2WikiMultiHopQA: More challenging multi-hop questions across two Wikipedia articles. Requires explicit comparison and reasoning.
- MuSiQue: Multi-hop questions with decomposed sub-questions, allowing analysis of which reasoning steps fail.
- Bamboogle: Adversarially constructed questions designed to fool models that rely on shortcuts. Tests genuine multi-hop reasoning.
These benchmarks cover a range of difficulties and reasoning patterns, providing a comprehensive evaluation of search agent capability.
Baselines
Dr. Zero is compared against several baselines:
Base LLM: The original Llama-3.1-8B-Instruct without any search agent training. This establishes the starting capability.
Supervised RL: Agents trained with reinforcement learning on human-curated questions from HotpotQA. This represents the traditional approach with full human supervision.
Prior Self-Evolution: Existing data-free self-evolution methods adapted for search agents. These provide the most direct comparison to Dr. Zero.
Frontier Models: GPT-4 and Claude evaluations (where available) provide reference points for state-of-the-art capability.
Results and Analysis
Main Results
The headline result is striking: Dr. Zero matches or surpasses fully supervised search agents without using any training data.
The experiments evaluate Dr. Zero against a comprehensive set of baselines across multiple benchmarks. The baselines include simple prompting, IRCoT (interleaved retrieval chain-of-thought), Search-o1, standard RAG, supervised fine-tuning (SFT), R1-Instruct, and Search-R1—the latter being a strong supervised RL baseline.
On Qwen2.5-3B-Instruct:
- Dr. Zero achieves 0.326 average accuracy across all benchmarks
- Outperforms all baselines on single-hop benchmarks (NQ: 0.397, TriviaQA: 0.572, PopQA: 0.431)
- Competitive with Search-R1 on multi-hop benchmarks while using zero training data
On Qwen2.5-7B-Instruct:
- Dr. Zero achieves 0.372 average accuracy, matching or exceeding Search-R1 (0.384)
- Best performance on 2WikiMQA (0.347) among all methods
- Strong multi-hop reasoning: HotpotQA (0.362), MuSiQue (0.104)
The pattern is consistent: Dr. Zero's data-free self-evolution approaches or surpasses methods that use human-curated training data. This challenges the assumption that human supervision is necessary for complex reasoning tasks.
Dr. Zero performance comparison across benchmarks with Qwen2.5 models
Ablation Studies
To understand which components drive Dr. Zero's success, the researchers conduct systematic ablations:
Proposer Multi-Turn Rollout: Removing the search engine from proposer training (generating questions from parametric memory only) reduces performance by 12-15% across benchmarks. The proposer generates less diverse and less factually grounded questions.
Difficulty-Guided Reward: Using a simpler reward (just solver accuracy, without the difficulty shaping) causes the proposer to collapse to easy questions. Performance drops 8-10%.
HRPO vs. Standard GRPO: With the same compute budget, HRPO achieves 4-6% higher accuracy than GRPO. Alternatively, HRPO achieves the same accuracy with 3-4x less compute.
Number of Training Iterations: Performance improves consistently across iterations 1, 2, and 3. Iteration 4 shows diminishing returns, suggesting the system approaches a capability ceiling for the 8B model scale.
These ablations confirm that each component of Dr. Zero contributes meaningfully to its success.
Scaling Properties
An important question for any training method: does it scale? The researchers investigate two scaling dimensions:
Scaling with Compute: Within each iteration, more training compute (more questions, more gradient updates) yields better performance, though with diminishing returns. The optimal allocation is roughly 70% of compute to solver training, 30% to proposer training.
Scaling with Iterations: Performance improves consistently across iterations 1-3. The automated curriculum successfully increases difficulty over time—average question hop count increases from 1.8 in iteration 1 to 3.2 in iteration 3.
Scaling with Model Size: Preliminary experiments with larger models (70B) show continued improvement, suggesting the method scales favorably. However, full scaling law characterization awaits future work.
Implications: The Future of Self-Evolving AI
Beyond Search Agents
While Dr. Zero focuses on search agents, the principles generalize. The core insight—using external tools as both task environment and supervision signal—applies wherever verifiable external feedback exists:
Code Generation: The code interpreter can verify program correctness, enabling self-evolution for coding agents.
Mathematical Reasoning: Symbolic math engines can verify solutions, as prior work has demonstrated.
Scientific Research: Simulation environments and databases can provide feedback for scientific reasoning agents.
Embodied Agents: Physical simulators can supervise robot learning through task success/failure signals.
The pattern is consistent: wherever there's a reliable external oracle, data-free self-evolution becomes possible.
Limitations and Open Questions
Despite its success, Dr. Zero has important limitations:
Requires External Oracle: The framework depends on a search engine that returns accurate information. For tasks without reliable external verification, the approach doesn't directly apply.
Ceiling Effects: After several iterations, improvement plateaus. The model may be hitting fundamental capability limits, or the self-evolution curriculum may be insufficiently diverse.
Efficiency: While HRPO reduces compute compared to standard GRPO, the total compute remains substantial. Each iteration requires thousands of search queries and GPU-hours of training.
Quality Control: Self-generated questions may contain errors or biases that propagate through training. The framework assumes the proposer generates high-quality questions, but this isn't guaranteed.
Open research questions include:
- Can self-evolution continue beyond current ceilings with architectural changes?
- How do we ensure self-generated training data is high quality and unbiased?
- Can multiple self-evolving agents collaborate to push each other further?
- What are the theoretical limits of data-free self-evolution?
The Bigger Picture
Dr. Zero represents a significant step toward AI systems that can improve themselves without human supervision. This has profound implications:
Reduced Annotation Costs: If self-evolution can match supervised learning, the economics of AI development shift dramatically. Teams no longer need massive annotation budgets for every new capability.
Faster Iteration: Self-evolution can run continuously without waiting for human data collection. Development cycles could accelerate from months to days.
Novel Capabilities: Self-evolution might discover reasoning strategies that humans wouldn't think to demonstrate. The model explores the space of possible strategies rather than imitating human examples.
Safety Considerations: Self-improving AI raises important safety questions. If models can enhance their own capabilities, how do we ensure they remain aligned with human values? Dr. Zero's search agents are narrowly focused, but the general principle of self-improvement warrants careful consideration.
References
-
Yue, Z., Upasani, K., Yang, X., Ge, S., Nie, S., Mao, Y., Liu, Z., & Wang, D. (2025). Dr. Zero: Self-Evolving Search Agents without Training Data. Meta Superintelligence Labs / UIUC. arXiv:2601.07055
-
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., ... & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint. arXiv:2402.03300
-
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., & Zhou, D. (2025). Large Language Models Can Self-Improve. arXiv preprint. arXiv:2210.11610
-
Zelikman, E., Wu, Y., Mu, J., & Goodman, N. D. (2022). STaR: Bootstrapping Reasoning with Reasoning. NeurIPS. arXiv:2203.14465
-
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint. arXiv:1707.06347
-
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. arXiv:2305.18290
-
Yang, Z., Qi, P., Zhang, S., Benber, Y., Choi, Y., & Fisch, A. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP. arXiv:1809.09600
-
Ho, X., Nguyen, A. K., Sugawara, S., & Aizawa, A. (2020). Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING. arXiv:2011.01060
-
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., & Lewis, M. (2022). Measuring and Narrowing the Compositionality Gap in Language Models. arXiv preprint. arXiv:2210.03350
-
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop Questions via Single-hop Question Composition. TACL. arXiv:2108.00573