Update Readme, release v1.0 (#12)

2026-04-19 12:58:04 +00:00 · 2025-08-29 16:49:54 +08:00 · 2025-08-29 16:49:54 +08:00 · f231ec6fa0
commit f231ec6fa0
parent a8249acc18
7 changed files with 288 additions and 235 deletions
--- a/README.md
+++ b/README.md
@ -1,186 +1,219 @@
-# Internbootcamp
+# InternBootcamp
-English | [中文](./README_zh.md) 
+
 <p align="center">
  <a href="https://arxiv.org/pdf/2508.08636">📄 Paper</a> •
  <a href="https://github.com/InternLM/InternBootcamp">⭐ Github</a> •
  <a href="examples/data/Intenbootcamp_eval">📊 Evaluation</a> • 
  <a href="https://intern.openxlab.org.cn/internthinker/go-game">⚪ Internthinker-Go</a> 
 </p>
 <p align="center">
 🌍 <a href="./README.md">English</a> | <a href="./README_zh.md">简体中文</a>
 </p>
 <p align="center">
 <img src="./figs/logo.png " width=400/> 
 </p>
-Internbootcamp is an easy-to-use and extensible library of bootcamp environments for training large reasoning models. By integrating **a comprehensive collection of verifiable tasks with unlimited automatic question generation and result verification**, Internbootcamp aims to provide data that enhances reasoning capabilities and their generalizability across diverse scenarios. Currently, Internbootcamp includes over one thousand verifiable reasoning tasks, covering problems related to logic, puzzles, algorithms, games, and more. We are continuing efforts to expand its scope with the community.
+InternBootcamp is an open-source framework comprising **1000+ domain-diverse task environments** specifically designed for LLM reasoning research. By integrating automated generation of unlimited training/testing cases with configurable difficulty levels and integrated verification modules, InternBootcamp serves as fundamental infrastructure for **RL-based model optimization**, **synthetic data generation**, and **model evaluation**. 
-## Getting Started
+Our key innovation lies in demonstrating that scaling the number of verifiable reasoning tasks during training significantly enhances both reasoning performance and training efficiency—a phenomenon we term **"Task Scaling"** 📈. Currently, InternBootcamp includes verifiable reasoning tasks across 8 diverse domains, covering problems related to algorithms, cryptography, natural science, language analysis, mathematical modeling, graphical puzzles, logical reasoning, and character puzzles. We are continuing efforts to expand its scope with the community.
-**Quickly get started with data generation, reinforcement learning training, model evaluation, and custom Bootcamp creation!**
+## 🚀 Getting Started
-* [Installation](#section3)
+
-* [Quick Start](examples/get_started.md)
+Quickly get started with data generation, reinforcement learning training, model evaluation, and custom Bootcamp creation!
-* [Interfaces and Usages](#section4)
+
 - [Installation](#installation)
 - [Quick Start](examples/get_started.md)
 - [Interfaces and Usages](#usage)
 - [Bootcamp-Eval Dataset](#examples/data/Intenbootcamp_eval)
 ## 📢 Update
 - 📦 **[2025/08] v1.0 released!**
 - 📄 **[2025/08] [InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling](https://arxiv.org/pdf/2508.08636) released.**
 - 🌱 **[2025/04] v0.1 released.**
 ## Update
-[2025/04] v0.1 released.
+## 🧩 About
-## 1 About
+Large-scale reinforcement learning has been demonstrated to be an effective way towards expert-level reasoning models. Most current efforts to advance this technical routine concentrate on limited tasks, such as math, and focus on devising improved training algorithms. Complementary, we believe the investigation into **Task Scaling**—exposing models to a wide and growing spectrum of reasoning tasks—is essential for building general and robust reasoning models:
-Large-scale reinforcement learning has been demonstrated to be an effective way towards expert-level reasoning models. Most current efforts to advance this technical routine concentrate on limited tasks, such as math, and focus on devising improved training algorithms. Complementary, we believe the investigation into **data** is also necessary for an esteemed reasoning model:
+- Including more types of tasks covers diverse reasoning patterns, leading to more general intelligence;
-  We could include more types of tasks to cover more diverse reasoning patterns, to build a more general reasoning model;
+- Studying tasks with controllable difficulties and their combinations facilitates understanding of training dynamics and enables more efficient training strategies.
 -  We could study tasks with more controllable difficulties and their combinations to facilitate our understanding of the training dynamics, to explore more efficient training strategies.
-Despite the abundance of potentially valuable tasks available, their dispersed distribution across various sources makes it exceptionally difficult for practitioners to utilize them. To this end, we introduce Internbootcamp to facilitate related investigations and provide engineering convenience. In particular, we would like to highlight the following features of Internbootcamp:
+Despite the abundance of potentially valuable tasks available, their dispersed distribution across various sources makes it exceptionally difficult for practitioners to utilize them. To this end, we introduce InternBootcamp to facilitate related investigations and provide engineering convenience. In particular, we would like to highlight the following features of InternBootcamp:
 - **Standardized: Internbootcamp provides a unified interface for various tasks and is easy to integrate with different codebases for reinforcement learning or synthetic data.**  In Internbootcamp, we define each task as a bootcamp class, which allows parameters to control the task difficulties. Each bootcamp class implements a unified interface to generate questions and verify the solutions. With the unified interface, tasks in Internbootcamp can be easily combined with codebases for reinforcement learning or synthetic data.
 - **Scalable: thanks to an automatic workflow for bootcamp synthesis, InternBootcamp has included a large volume of diverse bootcamp tasks.**  In the first release, InternBootcamp has been covering over 1000 complex reasoning tasks, including various task types such as games, logic problems, puzzles, algorithms, and more. Among these bootcamps, over 96% were developed through an automated workflow for synthesis and quality filtration, enabling the continuous scaling of bootcamp environments with low intervention.
 - **Extensible: Internbootcamp can be extended to support more diverse and complicated tasks (e.g., tasks with multiturn interaction like Go and agents) and provide them with question generation and results verification.** Representatively, we include the `InternGObootcamp` as a demonstration.
-We also conduct a series of investigations into reinforcement learning using Internbootcamp. Our preliminary findings are as follows:
+- **🔧 Standardized:** InternBootcamp provides a unified interface for various tasks and is easy to integrate with different codebases for reinforcement learning or synthetic data. Each bootcamp class implements standardized methods to generate questions and verify solutions, allowing seamless integration with reinforcement learning or synthetic data pipelines.
 - As a representative of single-task training, we train `InternThinker-GO` with the InternGObootcamp in Internbootcamp. `InternThinker-GO` approaches professional players using far fewer games than AlphaGO, which surpasses current reasoning models. Besides excellent performance, `InternThinker-GO` can provide reasonable and inspiring thoughts, demonstrating the great potential of human-like reasoning, empowered with reinforcement learning, in tackling expert-level tasks.
 - Besides, mixing the bootcamps together, we find that there remains a large room for current reasoning models to improve in bootcamp tasks, but reinforcement learning can effectively improve their performance. Through training on merely 22k prompts mixed with all verified tasks, the outcome models successfully improve by 27% based on `Deepseek-R1-Distill-Qwen-32B`, surpassing frontier reasoning models like `Claude-3.7-Sonnet` and `Deepseek-R1`, and continually improves with more training steps.
 - Noteworthy, reinforcement learning on bootcamp tasks can even lead to consistent improvement in benchmarks for general reasoning, such as expert knowledge, math, and coding. Moreover, in multitask training, we find that the generalization across tasks brings the "emerging moment" when some challenging task that fails to improve in single-task training succeeds to be learnt in mixed training, unveiling the potential benefits of scaling training tasks.
-## 2 Supported bootcamps<a id="section2"></a>
+- **📊 Scalable:** Thanks to an automatic agent workflow for bootcamp synthesis, InternBootcamp has grown to include a large volume of diverse bootcamp tasks. In the first release, it covers over 1000 complex reasoning tasks across 8 domains, including games, logic problems, puzzles, algorithms, scientific reasoning, and more. Over 90% of these bootcamps were developed through an automated synthesis and quality filtration pipeline, enabling continuous scaling of bootcamp environments with minimal human intervention.
-In the first release, Internbootcamp has covered bootcamps for 1060 tasks, coming from:
+- **🧱 Extensible:** InternBootcamp can be extended to support more diverse and complicated tasks (e.g., tasks with multi-turn interaction like Go and agent-based environments) and provide them with question generation and results verification. Representatively, we include `InternGObootcamp` as a demonstration.
 - **Benchmarks for reasoning.** Currently, we have included tasks from [ARC-AGI](https://github.com/fchollet/ARC-AGI), [KOR-Bench](https://kor-bench.github.io/), and [BBEH](https://github.com/google-deepmind/bbeh), three representative reasoning benchmarks, to build bootcamps. Here, KOR-Bench includes five types of reasoning tasks, namly logic, operation, cipher, puzzle, and counterfactual reasoning, where we neglect counterfactual reasoning for its dependance on specific world-view knowledge and build bootcamps for the remaining four types of tasks. BBEH is 23 reasoning tasks obtained by complicating tasks from BBH, and we build bootcamps for tasks do not depend on external knowledge.
 - **Puzzle websites.** [puzzle-xxx](https://www.puzzle-aquarium.com/) is a series of webpages of puzzles, we scratch 39 puzzles on it to prepare corresponding bootcamps.
 - **Algorithm problems.** Algorithm problems cover reasoning patterns in various algorithms, and they contain questions that are close to real-world applications. Meanwhile, algorithm problems in the wild usually contain reference solutions, making it easy to convert them into bootcamps. Currently, we use [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) and select 1265 tasks with medium difficulties (codeforces points between 1000 and 2000) and apply our automatic workflow to construct corresponding bootcamps. Additionally, we adapted tasks from [CodeIO](https://codei-o.github.io/), which translates code-based reasoning into natural language to assess large language models' reasoning capabilities.
 - **Benchmarks for programming capabiltiy.** Currently we have included tasks from [BigCodeBench](https://bigcode-bench.github.io/) and [KodCode](https://kodcode-ai.github.io/), two representative programming benchmarks, to build bootcamps. These benchmarks feature diverse and challenging problems that require language models to generate correct code. For each task, we collected or adapted a `unittest` script to validate solution correctness.
 - **Instruction following.** These tasks test a model's ability to comprehend and strictly adhere to instructions embedded in task descriptions. In many cases, correctness can be evaluated through code execution feedback. We included tasks from [AutoIF](https://github.com/QwenLM/AutoIF), which contains over 60,000 instruction–evaluation function pairs, each treated as an individual task.
 - **Games.** Games are a kind of complex reasoning tasks involving multiturn interactions featuring controllable and verifiable objectives. As a representative, we built `InternGObootcamp` to train a reasoning model for Go.
-Among all the bootcamps, 84 of them are implemented or verified by human programmers, 978 are produced with an automatic workflow, and there are also 2 bootcamps under development for special purposes. 
+We also conduct a series of investigations into reinforcement learning using InternBootcamp. Our preliminary findings are as follows:
-We are continuing our efforts and call for the community to verify the automatically generated bootcamps. We present the full list of bootcamps ([the full bootcamp list](./Fulllist_InternBootcamp.md))and illustrate our automatic workflow below.
+- **Scalable task synthesis enables broad experiential learning**: Our automated agent workflow demonstrates that large-scale, diverse reasoning environments can be effectively synthesized via iterative, evolutionary methods, opening the door to training agents on a continuous stream of novel tasks.
 For specific purposes, we design Arcbootcamp，KORbootcamp、BBEHbootcamp and InternGobootcamp.
 Here, Arcbootcamp serves for optimizing performance on [ARC-AGI](https://github.com/fchollet/ARC-AGI). To develop this bootcamp, we largely refer to the implementation in [michaelhodel/re-arc](https://github.com/michaelhodel/re-arc).
-### bootcamps from Automatic Workflow
+- **Generalization emerges from cross-task exposure**: LLMs develop stronger reasoning generalization and emergent abilities not through deep specialization in narrow domains, but by learning across a wide spectrum of reasoning tasks.
 A vast number of tasks can be built into bootcamps. Writing bootcamp code for each task manually is hugely inefficient and hardly scalable. To this end, we introduce an automated workflow that leverages the increasingly mature code generation capabilities of large language models to generate bootcamp code from task descriptions to achieve large-scale bootcamp construction. This process consists of three stages: (1) Task Description Collection; (2) Code Generation; (3) bootcamp Validation and Filtering. 
-The full list of bootcamps generated by an automatic workflow can be found in [the full bootcamp list](./Fulllist_InternBootcamp.md).  We illustrate the details of the automatic workflow as follows.
+- **Task scaling improves both performance and efficiency**: Increasing the number of training tasks significantly boosts both final performance and learning efficiency, with a near-linear relationship between task quantity and reasoning capability.
 - **InternThinker-GO**: As a representative of single-task training, we train `InternThinker-GO` with the InternGObootcamp. `InternThinker-GO` approaches professional players using far fewer games than AlphaGO, surpassing current reasoning models. Besides excellent performance, `InternThinker-GO` provides reasonable and inspiring thoughts, demonstrating the great potential of human-like reasoning empowered by reinforcement learning in tackling expert-level tasks.
 ## 🎯 Supported Bootcamps
 <p align="center">
-<img src="./figs/auto_pipeline.png " width=600/> 
+  <img src="./figs/bootcamps.png" width="1000"/> 
 </p>
-**Collecting Task Descriptions:** First, we identify a series of tasks whose questions can be automatically generated and verified. These include, but are not limited to: (1) Puzzles, such as those from the puzzle-xxx websites; (2) Benchmarks for reasoning, like ARC-AGI, KOR, and BBEH; (3) Algorithm problems, which can be solved by specific algorithms and whose problem statements resemble real-world application problems. We collect descriptions of these problems and any available supporting information, such as sample questions, example solution programs, etc., to constitute the task description.
+In the first release, InternBootcamp has covered bootcamps for **1000+ tasks**, coming from:
-**Code Generation:** We utilize frontier coding models, e.g., Deepseek-R1, to generate the codes of the bootcamps. We provide the model with the required interfaces with their descriptions, along with the task descriptions collected in the previous step. The model then synthesizes the bootcamp code. In early attempts, we found that some model-synthesized code suffered from issues like oversimplification or runtime errors. To address this, we adopted a multi-round synthesis approach. After each round of synthesizing bootcamp code, we feed the code's runtime results back to the model for further rounds of generation. We also sample multiple times in parallel for each task to ensure the success rate.
+- **🧠 Benchmarks for reasoning:** Currently, we have included tasks from [ARC-AGI](https://github.com/fchollet/ARC-AGI), [re-arc](https://github.com/michaelhodel/re-arc), [KOR-Bench](https://kor-bench.github.io/), and [BBEH](https://github.com/google-deepmind/bbeh), three representative reasoning benchmarks, to build bootcamps. Here, KOR-Bench includes five types of reasoning tasks, namely logic, operation, cipher, puzzle, and counterfactual reasoning, where we neglect counterfactual reasoning for its dependence on specific world-view knowledge and build bootcamps for the remaining four types of tasks. BBEH is 23 reasoning tasks obtained by complicating tasks from BBH, and we build bootcamps for tasks that do not depend on external knowledge.
-**bootcamp Validation and Filtering:** For the generated bootcamp code, we designed a testing procedure to ensure the code can run correctly for large-scale data generation and result validation. Specifically, we sample questions with synthetic bootcamps and test the score achieved by reasoning LLMs, also using the verification codes generated. Based on the program's execution results and evaluated scores, we exclude bootcamp environments that contain obvious errors (e.g., execution errors or scored 0) or are overly simplified (accuracy close to 1).
+- **🧩 Puzzle websites:** [puzzle-xxx](https://www.puzzle-aquarium.com/) is a series of webpages of puzzles; we scratch 39 puzzles on it to prepare corresponding bootcamps.
 - **⚙️ Algorithm problems:** Algorithm problems cover reasoning patterns in various algorithms, and they contain questions that are close to real-world applications. Meanwhile, algorithm problems in the wild usually contain reference solutions, making it easy to convert them into bootcamps. Currently, we use [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) and select 1265 tasks with medium difficulties (codeforces points between 1000 and 2000) and apply our automatic workflow to construct corresponding bootcamps. Additionally, we adapted tasks from [CodeIO](https://codei-o.github.io/), which translates code-based reasoning into natural language to assess large language models' reasoning capabilities.
 - **💻 Benchmarks for programming capability:** Currently, we have included tasks from [BigCodeBench](https://bigcode-bench.github.io/) and [KodCode](https://kodcode-ai.github.io/), two representative programming benchmarks, to build bootcamps. These benchmarks feature diverse and challenging problems that require language models to generate correct code. For each task, we collected or adapted a `unittest` script to validate solution correctness.
 - **📋 Instruction following:** These tasks test a model's ability to comprehend and strictly adhere to instructions embedded in task descriptions. In many cases, correctness can be evaluated through code execution feedback. We included tasks from [AutoIF](https://github.com/QwenLM/AutoIF), which contains over 60,000 instruction–evaluation function pairs, each treated as an individual task.
 - **🎮 Games:** Games are a kind of complex reasoning tasks involving multi-turn interactions featuring controllable and verifiable objectives. As a representative, we built `InternGObootcamp` to train a reasoning model for Go.
 - **🔬 Scientific tasks:** Scientific tasks represent a spectrum of reasoning-intensive endeavors deeply intertwined with scientific research activities, which are regarded as one of the most valuable domains where AI will revolutionize. We consider that improving reasoning models on these tasks facilitates the achievement of this vision. Part of the scientific task collection is supported by the [Intern-S1](https://arxiv.org/pdf/2508.15763) team, and in return, InternBootcamp also provides training support for Intern-S1.
 We are continuing our efforts and call for the community to verify the automatically generated bootcamps. We present the full list of bootcamps ([the full bootcamp list](./Fulllist_InternBootcamp.md)) and illustrate our automatic workflow below.
-## 3 Installation<a id="section3"></a>
+## 🤖 Automatic Agent Workflow for Large-Scale Bootcamp Synthesis
-```
+<p align="center">
-git clone https://github.com/InternLM/Internbootcamp.git
+  <img src="./figs/workflow.png" width="1000"/> 
-cd internbootcamp
+</p>
 Manually coding bootcamps for each task is inefficient and not scalable. We introduce an **automatic agent workflow** that leverages large language models to generate bootcamp code from task descriptions. This pipeline involves:
 1. **📥 Task Description Collection:** Identify verifiable tasks (puzzles, reasoning benchmarks, algorithm problems, etc.) and collect their descriptions and supporting information.
 2. **🔄 Evolutionary Code Generation:** Use strong coding models (e.g., Deepseek-R1) to generate bootcamp code iteratively, incorporating execution feedback to avoid oversimplification and errors.
 3. **✅ Self-Consistent Unittest Filtering:** Filter bootcamps by evaluating LLM responses using the `verify_function` as unittests. Bootcamps with accuracy outside [0.03, 0.85] are filtered out.
 This workflow has enabled rapid expansion to 1000+ bootcamps with high quality and diversity.
 ## 🛠 Interfaces and Usages
 Each bootcamp inherits from `BaseBootcamp` and contains three main interfaces: `case_generator`, `prompt_func`, and `verify_func`, to serve for question generation and result verification.
 <p align="center">
  <img src="./figs/bootcamp-framework.png" width="1000"/> 
 </p>
 ### Installation <a id="installation"></a>
 ```bash
 git clone https://github.com/InternLM/InternBootcamp.git 
 cd InternBootcamp
 pip install -e .
 ```
-## 4 Interfaces and Usages<a id="section4"></a>
+### Example: Game24Bootcamp <a id="usage"></a>
-Each bootcamp inherits from Basebootcamp and contains three main interfaces: `case_generator`, `prompt_func`, and `verify_score`, to serve for question generation and result verification.
+Game24 is an arithmetic puzzle where you use `num_numbers` numbers (each ≤ `range_max`) and basic operations to obtain the `target` value (≤ `target_max`).
 <p align="center">
 <img src="./figs/interface.png " width=600/> 
 </p>
 ### Example (Game24bootcamp)
 We introduce the usage of the bootcamp with `Game24bootcamp` as an example. Other supported bootcamps can be found in [the full bootcamp list](./Fulllist_InternBootcamp.md).
 Game24 is a simple arithmetic puzzle where you are given `num_numbers` numbers, each up to a value of `range_max`. The goal is to obtain the `target` value using the four basic arithmetic operations, where the `target` value does not exceed `target_max`.
 #### Generating Questions
 First, let's instantiate the bootcamp:
 ```python
 from internbootcamp import Game24bootcamp
 # You can specify the problem difficulty
 bootcamp = Game24bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)
 # You can also use the default parameter configuration
 # bootcamp_default = Game24bootcamp()
 ```
 Then, use the `case_generator` interface (method) to generate a problem instance, and use `prompt_func` to convert the instance into a natural language question:
 ```python
 from internbootcamp import Game24Bootcamp
 # Specify difficulty parameters
 bootcamp = Game24Bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)
 # Or use default configuration
 # bootcamp_default = Game24Bootcamp()
 identity = bootcamp.case_generator()
 prompt = bootcamp.prompt_func(identity)
 # Example Output:
 # - identity: {'puzzle': '8 43 65 77', 'target': 28} 
-# - prompt: "Please solve the puzzle：using 8 43 65 77 to obtain 28 through basic arithmetics Let's hink step by step and output the final answer within \\boxed{}.The final answer should be all input numbers with basic operations, and parentheses can be used to change the order of operations. For example \"Final Answer: \\boxed{6+6+(6+6)}\"."
+# - prompt: "Please solve the puzzle: using 8 43 65 77 to obtain 28 through basic arithmetics..."
 ```
 #### Verifying Results
 After obtaining the response to the question, use the `verify_score` interface (method) to score the result. By default, a score of 1 is returned for a correct result, and a score of 0 is returned for an incorrect result. Additionally, you can specify format_score (defaulting to 0), which is the score returned only if the response format is correct.
 ```python
 response = "...some reasoning process...\\boxed{77 / (65 - 43) * 8}"
-score = Game24bootcamp.verify_score(response, identity, format_score=0.1)
+score = Game24Bootcamp.verify_score(response, identity, format_score=0.1)
 ```
 ### Extending to More Tasks
-Based on the interfaces described above, Internbootcamp can be easily extended with new tasks. See [examples/README.md](examples/README.md) for details.
+
 New bootcamps can be created by inheriting from `BaseBootcamp` and implementing its core interfaces. See [examples/README.md](examples/README.md) for details.
 ### Reinforcement Learning
-Internbootcamp can be easily integrated into various mainstream reinforcement learning or synthetic data frameworks. See [examples/bootcamp-rl/README.md](examples/xpuyu_usage/README.md) for details.
+
 InternBootcamp integrates easily with RL frameworks. See [examples/README.md](examples/README.md) for details.
-## 5 Training Reasoning Models with Internbootcamp
+## 🧪 Experiments: Boosting LLM Reasoning with Verifiable Task Scaling
-Internbootcamp provides training data with infinite quantities, diverse task coverage, and controllable difficulties for training reasoning models. Based on this, we conduct a series of explorations with some preliminary results as follows.
+We conduct extensive experiments to investigate how **Task Scaling**—training with an increasing number and diversity of reasoning tasks—enhances the reasoning capabilities of large language models. Our findings demonstrate that Task Scaling not only improves final performance but also significantly boosts training efficiency.
 ### DEMO: InternThinker-GO
 LLMs have demonstrated remarkable performance across a wide range of common reasoning tasks. However, as one of the earliest research problems that ignited the AI boom, the reasoning capabilities of general-purpose LLMs in the specific domain of **Go** have received little research attention. While AlphaZero challenged human intelligence in the Go domain from the perspective of "Mastering the Game of Go without Human Knowledge," we explore how to bring human intelligence back to this ancient game, allowing the natural language thinking patterns unique to humans to shine again in the new context of LLMs. Based on Internbootcamp, we implemented a Go bootcamp for reinforcement learning of reasoning models, cold-started with professional Go domain data, and reinforced the model's reasoning paradigm through reinforcement learning. Our model achieves performance comparable to professional Go players - InternThinker-GO can consistently defeat Golaxy AI at amateur 6-dan level and approaches the professional 1-star level, making it the first general large language model to reach this level of performance.
 <p align="center">
-<img src="figs/demo_internthinkerGO.png" width=1200/> 
+  <img src="./figs/overall.png" width="800"/> 
 </p>
-For a given state, InternThinker-GO first analyzes the situation on the board: "There is a complex battle area in the upper right corner, where both Black and White have multiple stones. The lower left corner has some interlaced Black and White stones, forming a certain structure. Black has a formation along the left edge. Black just played move 65 at B14, which is clearly intended to invade White's territory on the left side. As White, I need to respond carefully to this threat." Next, InternThinker-GO specifically predicts and analyzes potential moves such as B13, C13, and ultimately selects B13 as the placement position.
+Through systematic scaling of training tasks, we observe consistent improvements in model performance across diverse reasoning domains. Models trained with more tasks achieve better generalization and higher accuracy on our Bootcamp-Eval benchmark, showcasing the effectiveness of Task Scaling in developing versatile reasoning models. Besides, scaling the number of training tasks also enhances training efficiency in RLVR process.
-
+Additionally, we discover that multi-task training enables an **Emergent Moment**—where tasks that are unsolvable in isolation suddenly become learnable when trained together with other tasks. This phenomenon demonstrates that cross-task knowledge transfer fosters latent generalization capabilities, allowing models to tackle complex challenges that would otherwise remain out of reach.
 ### Training with Task Mixing
 We train and evaluate our model on a mix of 80 validated bootcamps based on `Deepseek-R1-Distill-Qwen-32B`, using GRPO. Our results show:
 - Frontier reasoning models still have a large room for improvement on bootcamp tasks, while reinforcement learning on these tasks is an effective way for this.
 - Interestingly, we find models trained on bootcamp tasks also get improvement on general reasoning benchmarks with unseen tasks such as math and professional knowledge QA, unveiling that bootcamp tasks can be effective for improving general reasoning capabilities. 
 - Meanwhile, we find training with task mixing fosters the "**emerging moment**", where some tasks that fail to learn alone suddenly gain improvement in mixed training after a certain training step. This demonstrates the potential of scaling training tasks, serving as a kind of implicit curriculum learning, to elicit challenging tasks.
 Details are as follows.
 **Performance on bootcamp Tasks** We randomly divided the difficulty configurations to generate data for the training and test sets on the validated bootcamps(`examples/data/InternBootcamp_eval`).
 <p align="center">
 <img src="figs/bootcamp_perf.png" width=500/> 
 </p>
 * With the data, we first evaluate the performance of mainstream reasoning models. It shows that these models still have a large room to improve on our bootcamp tasks, and some reasoning models do not show an obvious advantage over their non-reasoning version.
 * Accordingly, we study whether we can effectively improve model performance on these bootcamp tasks with reinforcement learning. Specifically, we use the generated training data to train `Deepseek-R1-Distill-Qwen-32B`. With 22k prompts, the accuracy on the test set considerably increases by 27%, surpassing `OpenAI o3-mini`, `Deepseek-R1` and `Claude-3.7-Sonnet`, demonstrating the effectiveness of reinforcement learning in enhancing model ability on these tasks.
 ||Qwen-32B-Instruct|Deepseek-R1-Distill-Qwen-32B|QwQ-32B|Deepseek-R1|Claude-3.7-Sonnet (w/o thinking)|Claude-3.7-Sonnet (w/ thinking)|OpenAI o3-mini|Ours|
 |-|-|-|-|-|-|-|-|-|
 |Accuracy on bootcamp Testset (%)| 30.0|30.5|23.0|45.3| 39.2| 54.5|41.3| **58.2**|
 **Emerging Moment** Notably, benefiting from intertask generalization, we find that training on a mixture of diverse tasks elicits the performance of challenging tasks that fail to learn when trained alone improves. As demonstrated in the figure, in the Tapa/KorOperationUnicdoe25ce task, the model hardly samples a correct solution, thus failing to learn when trained alone. Nonetheless, its performance gets elicited and then improves after a few steps of mixed training. This demonstrates the potential of scaling training tasks in reinforcement learning to bring implicit curriculum learning for tackling challenging tasks.
 <p align="center">
-  <img src="figs/heartbeat_tapa.png" width=350/>  <img src="figs/heartbeat_kor25ce.png" width=350/>
+  <img src="./figs/emergent.png" width="800"/> 
 </p>
 📌 For detailed experimental results and comprehensive analysis, please refer to our [technical report](https://arxiv.org/pdf/2508.08636) 📝.
 ## ⚫ DEMO: [InternThinker-GO](https://intern.openxlab.org.cn/internthinker/go-game)
 LLMs have demonstrated remarkable performance across a wide range of common reasoning tasks. However, as one of the earliest research problems that ignited the AI boom, the reasoning capabilities of general-purpose LLMs in the specific domain of **Go** have received little research attention. While AlphaZero challenged human intelligence in the Go domain from the perspective of "Mastering the Game of Go without Human Knowledge," we explore how to bring human intelligence back to this ancient game, allowing the natural language thinking patterns unique to humans to shine again in the new context of LLMs. Based on InternBootcamp, we implemented a Go bootcamp for reinforcement learning of reasoning models, cold-started with professional Go domain data, and reinforced the model's reasoning paradigm through reinforcement learning. Our model achieves performance comparable to professional Go players - InternThinker-GO can consistently defeat Golaxy AI at amateur 6-dan level and approaches the professional 1-star level, making it the first general large language model to reach this level of performance.
 <p align="center">
  <img src="figs/demo_internthinkerGO.png" width="1000"/> 
 </p>
 For a given state, InternThinker-GO first analyzes the situation on the board: *"There is a complex battle area in the upper right corner, where both Black and White have multiple stones. The lower left corner has some interlaced Black and White stones, forming a certain structure. Black has a formation along the left edge. Black just played move 65 at B14, which is clearly intended to invade White's territory on the left side. As White, I need to respond carefully to this threat."* Next, InternThinker-GO specifically predicts and analyzes potential moves such as B13, C13, and ultimately selects B13 as the placement position.
 ## 🙏 致谢
 We acknowledge the following excellent projects, whose work has provided significant inspiration and tooling for our efforts:
 - [Intern-S1](https://github.com/InternLM/Intern-S1)
 - [VeRL](https://github.com/volcengine/verl)
 - [Xtuner](https://github.com/InternLM/xtuner)
 - [OpenCompass](https://github.com/open-compass/opencompass)
-## Citation
+## 📜 Citation
-If you find our library helpful, please cite
+If you find our work helpful, please cite:
-```
+
-@software{Internbootcamp,
+```bibtex
-  author = {Internbootcamp Team},
+@misc{li2025internbootcamptechnicalreportboosting,
-  title = {Internbootcamp},
+      title={InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling}, 
-  version = {0.0.4},
+      author={Peiji Li and Jiasheng Ye and Yongkang Chen and Yichuan Ma and Zijie Yu and Kedi Chen and Ganqu Cui and Haozhan Li and Jiacheng Chen and Chengqi Lyu and Wenwei Zhang and Linyang Li and Qipeng Guo and Dahua Lin and Bowen Zhou and Kai Chen},
-  month = {4},
+      year={2025},
-  year = {2025}
+      eprint={2508.08636},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.08636}, 
 }
 ```
--- a/README_zh.md
+++ b/README_zh.md
@ -1,183 +1,203 @@
 # InternBootcamp
-# Internbootcamp
+<p align="center">
  <a href="https://arxiv.org/pdf/2508.08636">📄 论文</a> •
  <a href="https://github.com/InternLM/InternBootcamp">⭐ Github</a> •
  <a href="examples/data/Intenbootcamp_eval">📊 评测数据</a> • 
  <a href="https://intern.openxlab.org.cn/internthinker/go-game">⚪ Internthinker-Go</a> 
 </p>
-[English](./README.md) | 中文
+<p align="center">
 🌍 <a href="./README.md">English</a> | <a href="./README_zh.md">简体中文</a>
 </p>
 <p align="center">
 <img src="./figs/logo.png " width=400/> 
 </p>
 InternBootcamp 是一个开源框架，包含 **1000+ 个领域多样的任务环境**，专为大型语言模型（LLM）推理研究而设计。通过集成可配置难度级别的无限训练/测试数据自动生成以及内置验证模块，InternBootcamp 可作为**基于强化学习的模型优化**、**合成数据生成**和**模型评估**的基础设施。
-Internbootcamp是一个易用、可扩展的推理模型训练训练场环境。通过集成大量可验证的推理任务并为其支持无限量的**自动化问题生成**和**结果验证**，Internbootcamp旨在提供推理形式多样且难度可控的大规模数据资源，以推动大模型在广泛场景下的推理能力提升及泛化。目前，Internbootcamp已集成上千种可验证的、难度可控的推理任务，包含游戏（games）、逻辑问题（logic）、谜题（puzzles）、算法问题（algorithms）等，并在持续扩展中。
+我们的核心创新在于证明了在训练过程中扩展可验证推理任务的数量能显著提升推理性能和训练效率——这一现象我们称之为 **“任务缩放”（Task Scaling）** 📈。目前，InternBootcamp 涵盖了 8 个不同领域的可验证推理任务，包括算法、密码学、自然科学、语言分析、数学建模、图形谜题、逻辑推理和字符谜题等相关问题。我们正持续努力并与社区共同扩展其范围。
-## Getting Started
+## 🚀 快速开始
 **快速开始数据生成、模型训练、模型评测以及自定义Bootcamp！**
 * [安装](#section3)
 * [快速开始](examples/get_started_zh.md)
 * [接口及用法](#section4)
 快速开始数据生成、强化学习训练、模型评估以及自定义 Bootcamp 的创建！
-## 更新
+- [安装](#安装)
 - [快速开始](examples/get_started.md)
 - [接口与用法](#接口与用法)
 - [Bootcamp-Eval 数据集](#examples/data/Intenbootcamp_eval)
-[2025/04] v0.1 released.
+## 📢 更新日志
-## 一、简介
+- 📦 **[2025/08] v1.0 版本发布！**
-大规模强化学习被证明是将大语言模型的推理能力推向专家级别的有效路径。当前相关研究集中在以数学为代表的有限任务上，聚焦于算法改进。与之互补，我们认为推理模型提升同样离不开**数据层面**的探索：
+- 📄 **[2025/08] [InternBootcamp 技术报告：通过可验证的任务缩放提升LLM推理能力](https://arxiv.org/pdf/2508.08636) 发布。**
-* 通过更多样的任务类型，覆盖更多样的推理模式，从而得到更通用推理模型；
+- 🌱 **[2025/04] v0.1 版本发布。**
 * 通过更可控的问题难度和任务组合，以促进理解强化学习过程的训练动力学，从而探索更高效的训练策略。
-尽管有大量潜在的任务（如谜题、算法问题等）满足上述需求，但其来源分散，难以被统一使用。为此，我们推出Internbootcamp以为相关研究和工程实现提供便利。具体而言，Internbootcamp具有以下亮点特征：
+## 🧩 关于
 1. **标准化：统一接口，易与各强化学习或合成数据框架集成。** Internbootcamp将每个任务定义为训练场环境类，并支持灵活传入参数以控制问题难度，通过统一的问题合成和结果验证接口，Internbootcamp可以轻量化集成进各主流框架进行强化学习或合成数据，详情见[接口及用法](#section4)。
 2. **规模化：在自动化构建流程加持下，集成海量多种类的任务训练场。** InternBootcamp覆盖训练场数量大，在首次发版中，已集成超过1000个复杂推理任务，包含游戏（Games）、逻辑问题（logic）、谜题（puzzles）、算法问题（algorithms）等多种类型。其中96%以上的训练场在我们设计的自动化训练场构建及质量筛选的工作流加持下得到，未来可以继续快速扩增，详情见[当前支持训练场](#section2)。
 3. **可扩展**：Internbootcamp可以被灵活扩展以支持更多样、流程更复杂（如围棋、智能体等多轮交互任务）的任务数据生成及结果验证。**我们以支持训练围棋博弈模型的`InternGObootcamp`作为示例，相关演示见[训练结果](#section5)。
 大规模强化学习已被证明是通向专家级推理模型的有效途径。当前推进这一技术路线的大多数努力都集中在有限的任务上（例如数学），并专注于设计改进的训练算法。互补地，我们认为对**任务缩放**（让模型接触广泛且不断增长的推理任务谱系）的研究对于构建通用且鲁棒的推理模型至关重要：
-在Internbootcamp的基础上，我们展开了一系列推理模型训练。
+- 包含更多类型的任务可以覆盖多样的推理模式，从而带来更通用的智能；
- 在单点代表性任务上，我们使用Internbootcamp训练了`InternThinker-GO`，在远少于AlphaGO所使用的训练棋局上达到接近职业棋手水平，远超当前前沿推理模型。在优异的博弈表现之余，`InternThinker-GO`还能输出合理且有启发性的推理过程，揭示了人类思维在强化学习加持下解决专业任务的无限潜力。
+- 研究具有可控难度及其组合的任务有助于理解训练动态，并实现更高效的训练策略。
 - 在大规模训练场混合训练中，我们发现当前前沿推理模型在训练场任务上仍有较大提升空间，通过进行强化学习训练可以有效持续提升模型在训练场任务上的性能。仅经过22k个提问的训练，我们将`Deepseek-R1-Distill-Qwen-32B`在训练场测试数据上的性能提升27%，超越`Claude-3.7-Sonnet`和`Deepseek-R1`，进一步训练的性能提升趋势仍未平缓。
 - 值得注意的是，我们确认了训练场任务对通用推理能力的泛化作用。无论是在`InternThinker-GO`还是训练场任务混合训练实验中，我们均观察到训练结果对专业领域、数学、代码等通用推理能力的**一致促进**。此外，得益于任务间的泛化，我们观察到混合训练带来的“**涌现时刻（Emerging Moment）**”，部分过于困难而在单独训练时无法改进的任务可以通过混合训练得到性能提升，展现了规模化训练任务以促进困难任务学习的可能路径。
-## 二、当前支持的训练场<a id="section2"></a>
+尽管存在大量潜在有价值的任务，但它们分散在不同的来源中，这使得实践者极难利用它们。为此，我们引入了 InternBootcamp 以促进相关研究并提供工程便利。特别地，我们想强调 InternBootcamp 的以下特点：
-目前Internbootcamp经筛选后覆盖了共计1060个任务训练场。当前任务来源如下：
+- **🔧 标准化：** InternBootcamp 为各种任务提供了统一的接口，易于与不同的代码库集成以进行强化学习或合成数据生成。每个 bootcamp 类都实现了标准化的方法来生成问题和验证解决方案，可以与强化学习或合成数据管道无缝集成。
- **推理评测基准数据集**：目前我们利用[ARC-AGI](https://github.com/fchollet/ARC-AGI)、[KOR-Bench](https://kor-bench.github.io/)和[BBEH](https://github.com/google-deepmind/bbeh)三个代表性推理评测基准的任务，构建了对应的训练场环境。其中，KOR-Bench涵盖逻辑、运算、密码、谜题、反事实推理五类推理任务，除去需要编写特定世界观知识设定的反事实任务，我们为前四类任务构建构建了训练场。BBEH是在BBH的任务基础上增加难度所得到的23个逻辑推理任务，我们为其中不依赖外部知识的任务构建了训练场。
+- **📊 自动化：** 得益于用于 bootcamp 合成的自动智能体工作流，InternBootcamp 已经发展到包含大量多样化的 bootcamp 任务。在第一个版本中，它涵盖了 8 个领域的 1000 多个复杂推理任务，包括游戏、逻辑问题、谜题、算法、科学推理等。超过 90% 的这些 bootcamp 是通过自动合成和质量过滤流水线开发的，能够以最少的人工干预持续扩展 bootcamp 环境。此外，InternBootcamp能够自动生成覆盖各类任务的难度可控的指令数据。
 - **谜题网站**：[puzzle-xxx](https://www.puzzle-aquarium.com/)是一系列谜题网站，我们爬取了其上共39个谜题构建训练场。
 - **算法问题：** 算法问题可以覆盖各种算法的推理模式且包含贴近现实需求的应用问题，同时算法问题具有参考解题程序，容易自动化转化为多样的训练场的优良特性。目前，我们利用[CodeContests](https://huggingface.co/datasets/deepmind/code_contests)这一算法问题数据集，以codeforces分数在1000-2000分之间为标准筛选了难度适中的1265道问题，并自动化构建其对应的训练场。此外，我们还引入了 [CodeIO](https://codei-o.github.io/) 中的任务，该数据集将代码推理模式转化为自然语言，以评估大模型的推理能力。
 - **编程能力基准数据集**：目前我们利用了来自 [BigCodeBench](https://bigcode-bench.github.io/) 和 [KodCode](https://kodcode-ai.github.io/) 两个代表性的编程能力基准数据集中的任务。这些任务主题多样且具备挑战性，要求语言模型生成正确的代码解答。另外，我们为每道任务收集或改写了`unittest`脚本，以验证生成代码的正确性。
 - **指令遵循任务**：该类任务要求模型理解任务描述中的指令并严格遵循其限制。在某些情境下，可以通过代码执行反馈来评估模型的遵循情况。目前我们支持来自 [AutoIF](https://github.com/QwenLM/AutoIF) 的任务集，它包含超过6万条指令遵循任务，每个任务都会对应一个指令与评估函数对。
 - **游戏：** 游戏是一类目标可控、可验证，涉及多轮交互的复杂推理任务，我们为训练围棋对弈能力构建对应的GO训练场。
-在目前已有的任务训练场中，80个任务由人工实现或校验，978个来自自动化训练场生成流程，另外有2个正在开发中的特别用途训练场。我们正持续投入并呼吁社区努力持续为自动化生成的训练场进行验证。我们列举全部训练场在列表中[完整列表](./Fulllist_InternBootcamp.md)。针对专门用途，我们开发了Arcbootcamp，KORbootcamp、BBEHbootcamp 和InternGobootcamp，并且仍在持续迭代中。
+- **🧱 可扩展：** InternBootcamp 可以扩展以支持更多样化和复杂的任务（例如，具有多轮交互的任务，如围棋和基于智能体的环境），并为它们提供问题生成和结果验证。代表性的，我们包含了 `InternGObootcamp` 作为演示。
 其中Arcbootcamp是针对优化[ARC-AGI](https://github.com/fchollet/ARC-AGI)任务的专用训练场，我们引用了[michaelhodel/re-arc](https://github.com/michaelhodel/re-arc)中的实现。
-### 自动化训练场生成
+我们还使用 InternBootcamp 进行了一系列强化学习研究。我们的初步发现如下：
-可构建为训练场的任务种类繁多，由人工逐一编写训练场代码的朴素方式效率较低，不易扩展。得益于大语言模型日益进步的代码生成能力，我们引入一套从任务描述生成训练场代码的自动化流程以实现大规模训练场构建，得到一系列[自动合成的训练场训练场](./Fulllist_InternBootcamp.md)。构建流程由三个环节构成：(1) 任务描述采集；(2) 代码生成；(3) 训练场验证及筛选。
+- **可扩展的任务合成实现了广泛的经验学习**：我们的自动智能体工作流表明，通过迭代、进化的方法可以有效地合成大规模、多样化的推理环境，为在持续的新任务流上训练智能体打开了大门。
-具体介绍如下。
+- **泛化能力源于跨任务接触**：LLMs 通过在各种推理任务谱系中学习，而非在狭窄领域深度专精，从而发展出更强的推理泛化和涌现能力。
 - **任务缩放同时提高性能和效率**：增加训练任务的数量显著提高了最终性能和学习效率，任务数量与推理能力之间存在近乎线性的关系。
 - **InternThinker-GO**：作为单任务训练的代表，我们使用 InternGObootcamp 训练了 `InternThinker-GO`。`InternThinker-GO` 使用远少于 AlphaGO 的对局数接近了职业棋手水平，超越了当前的推理模型。除了优异的性能，`InternThinker-GO` 提供了合理且富有启发性的思考，展示了由强化学习赋能的人类式推理在应对专家级任务方面的巨大潜力。
 ## 🎯 支持的 Bootcamps
 <p align="center">
-<img src="./figs/auto_pipeline_cn.png " width=600/> 
+  <img src="./figs/bootcamps.png" width="1000"/> 
 </p>
-**任务描述采集**：我们首先确定一系列题目可以自动化生成并验证的任务来源，包括但不限于(1)益智谜题，puzzle-xxx系列网站；(2)逻辑推理测试基准，如ARC-AGI、KOR、BBEH；(3)算法问题，这类问题可以被特定算法解决，同时题面与现实应用问题较为相似。我们分别采集这些问题的简介及可能存在的辅助信息如问题样例、示例解题程序等，构成任务描述。
+在第一个版本中，InternBootcamp 已覆盖 **1000+ 个任务** 的 bootcamps，来源包括：
-**代码生成**：我们利用前沿代码模型，如`Deepseek-R1`，生成训练场代码。具体而言，我们提供训练场所需实现的接口及接口描述，以及上一步采集的问题描述，由模型合成训练场代码。在早期尝试中，我们发现部分模型合成代码存在过度简化或运行错误等问题，对此我们采用多轮合成的方式，在每轮合成训练场代码后将训练场代码运行结果反馈至模型，进行多轮生成，并对每个任务平行重复多次采样。
+- **🧠 推理基准测试：** 目前，我们已从 [ARC-AGI](https://github.com/fchollet/ARC-AGI), [re-arc](https://github.com/michaelhodel/re-arc), [KOR-Bench](https://kor-bench.github.io/), 和 [BBEH](https://github.com/google-deepmind/bbeh) 这三个代表性推理基准中包含了任务来构建 bootcamps。其中，KOR-Bench 包含五种推理任务类型，即逻辑、操作、密码、谜题和反事实推理，我们忽略了反事实推理（因其依赖于特定的世界观知识），并为其余四种类型的任务构建了 bootcamps。BBEH 是通过复杂化 BBH 中的任务得到的 23 个推理任务，我们为那些不依赖外部知识的任务构建了 bootcamps。
-**训练场验证及筛选**：对于生成的训练场代码，我们设计了一套测试流程，确保代码可以正常运行进行海量数据生成及结果验证，并分别采样数据测试推理大模型在该训练场的得分，根据程序运行结果及得分以排除有明显错误（代码运行错误或模型过低得分）或过度简化的训练场环境（模型准确率接近1）。
+- **🧩 谜题网站：** [puzzle-xxx](https://www.puzzle-aquarium.com/) 是一个系列的谜题网页；我们抓取了其中的 39 个谜题来准备相应的 bootcamps。
 - **⚙️ 算法问题：** 算法问题涵盖了各种算法中的推理模式，并且包含接近实际应用的问题。同时，现成的算法问题通常包含参考解决方案，便于将其转换为 bootcamps。目前，我们使用 [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) 并选择了 1265 个中等难度（codeforces 分数在 1000 到 2000 之间）的任务，应用我们的自动工作流来构建相应的 bootcamps。此外，我们还改编了来自 [CodeIO](https://codei-o.github.io/) 的任务，它将基于代码的推理转化为自然语言，以评估大语言模型的推理能力。
-## 三、安装<a id="section3"></a>
+- **💻 编程能力基准测试：** 目前，我们已经从 [BigCodeBench](https://bigcode-bench.github.io/) 和 [KodCode](https://kodcode-ai.github.io/) 这两个代表性的编程基准测试中包含了任务来构建 bootcamps。这些基准测试具有多样化和具有挑战性的问题，要求语言模型生成正确的代码。对于每个任务，我们收集或改编了一个 `unittest` 脚本来验证解决方案的正确性。
-```
+- **📋 指令遵循：** 这些任务测试模型理解和严格遵守任务描述中嵌入指令的能力。在许多情况下，可以通过代码执行反馈来评估正确性。我们包含了来自 [AutoIF](https://github.com/QwenLM/AutoIF) 的任务，它包含超过 60,000 个指令-评估函数对，每个都被视为一个独立的任务。
-git clone https://github.com/InternLM/Internbootcamp.git
+
-cd internbootcamp
+- **🎮 游戏：** 游戏是一种复杂的推理任务，涉及多轮交互，具有可控和可验证的目标。作为代表，我们构建了 `InternGObootcamp` 来训练一个用于围棋的推理模型。
 - **🔬 科学任务：** 科学任务代表了一系列与科学研究活动深度交织的推理密集型工作，这被视为人工智能将革命性改变的最有价值领域之一。我们认为改进模型在这些任务上的推理能力有助于实现这一愿景。部分科学任务集的构建得到了 [Intern-S1](https://arxiv.org/pdf/2508.15763) 团队的支持，作为回报，InternBootcamp 也为 Intern-S1 提供训练支持。
 我们正在持续努力，并呼吁社区验证自动生成的 bootcamps。我们在下方展示了 bootcamps 的完整列表（[完整 bootcamp 列表](./Fulllist_InternBootcamp.md)）并说明了我们的自动工作流。
 ## 🤖 大规模 Bootcamp 合成的自动智能体工作流
 <p align="center">
  <img src="./figs/workflow.png" width="1000"/> 
 </p>
 为每个任务手动编写 bootcamp 代码效率低下且不可扩展。我们引入了一个**自动智能体工作流**，利用大语言模型根据任务描述生成 bootcamp 代码。该流水线包括：
 1.  **📥 任务描述收集：** 识别可验证的任务（谜题、推理基准、算法问题等）并收集它们的描述和支持信息。
 2.  **🔄 进化式代码生成：** 使用强大的代码模型（例如 Deepseek-R1）迭代生成 bootcamp 代码，结合执行反馈以避免过度简化和错误信息。
 3.  **✅ 自洽单元测试过滤：** 利用Bootcamp自身的验证器评估语言模型的回复，通过检测是否能正确执行以及模型回复的通过率作为单元测试。准确率超出 [0.03, 0.85] 范围的 bootcamps 将被过滤掉。
 该工作流已实现快速扩展到 1000+ 个高质量、多样化的 bootcamps。
 ## 🛠 接口与用法
 每个 bootcamp 继承自 `BaseBootcamp`，并包含三个主要接口：`case_generator`、`prompt_func` 和 `verify_func`，用于服务问题生成和结果验证。
 <p align="center">
  <img src="./figs/bootcamp-framework.png" width="1000"/> 
 </p>
 ### 安装 <a id="安装"></a>
 ```bash
 git clone https://github.com/InternLM/InternBootcamp.git 
 cd InternBootcamp
 pip install -e .
 ```
 ### 示例：Game24Bootcamp <a id="接口与用法"></a>
-## 四、接口及用法<a id="section4"></a>
+24点是一类算术谜题，需使用 `num_numbers` 个数字（每个 ≤ `range_max`）和基本运算来获得 `target` 值（≤ `target_max`）。
 #### 生成问题
 在`Internbootcamp`中，每个任务被定义为一个bootcamp，实现为一个继承自`Basebootcamp`的类，包含`case_generator`,`prompt_func`,`verify_score`三个主要接口以实现生成问题和结果验证的功能。
 <p align="center">
 <img src="./figs/interface.png " width=600/> 
 </p>
 ### 使用示例 (Game24训练场)
 以Game24训练场为例介绍训练场使用如下，目前已支持的所有训练场可参照[完整列表](./Fulllist_InternBootcamp.md)
 Game24是一个简单的算术谜题，提供`num_numbers`个大小为`range_max`以内的数，通过四则运算得到`target`，其中`target`的大小不超过`target_max`。
 #### 生成问题 
 首先我们实体化训练场
 ```python
-from internbootcamp import Game24bootcamp
+from internbootcamp import Game24Bootcamp
-# 你可以指定问题的难度
+# 指定难度参数
-bootcamp = Game24bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)  
+bootcamp = Game24Bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)
 # 或者使用默认配置
 # bootcamp_default = Game24Bootcamp()
 # 你也可以使用默认的参数配置  
 # bootcamp_default = Game24bootcamp()
 ```
 然后我们并利用`case_generator`接口合成问题实例，并通过`prompt_func`将实例转换为自然语言接口
 ```python
 identity = bootcamp.case_generator()
 prompt = bootcamp.prompt_func(identity)
-"""
+# 示例输出:
-预期结果：
+# - identity: {'puzzle': '8 43 65 77', 'target': 28} 
- identity: {'puzzle': '8 43 65 77', 'target': 28}
+# - prompt: "请解决这个谜题：使用 8, 43, 65, 77 通过基本算术运算得到 28..."
 - prompt: '请解决以下问题：使用数字 8 43 65 77 通过加减乘除得到 28。\nLet\'s think step by step and output the final answer within \\boxed{}.The final answer should be all input numbers with basic operations, and parentheses can be used to change the order of operations. For example "Final Answer: \\boxed{6+6+(6+6)}".'
 """
 ```
 #### 结果验证
 在获取问题的回复response后，我们可以使用verify_score接口获取结果打分。默认情况下，对于正确的结果返回打分为1，错误的结果返回打分为0。同时你可以额外人为指定format_score作为仅回复格式正确时返回的分数，默认为0。
 ```python
 response = "...some reasoning process...\\boxed{77 / (65 - 43) * 8}"
 score = Game24bootcamp.verify_score(response, identity, format_score=0.1) 
 ```
-### 任务扩展
+#### 验证结果
-基于上述接口，Internbootcamp可以很容易扩展新的任务，具体见[扩展示例](examples/README_zh.md)。
+
 ```python
 response = "...一些推理过程...\\boxed{77 / (65 - 43) * 8}"
 score = Game24Bootcamp.verify_score(response, identity, format_score=0.1)
 ```
 ### 扩展到更多任务
 您可以通过继承 `BaseBootcamp` 类来轻松添加新任务。详见 [examples/README.md](examples/README.md)。
 ### 强化学习
 Internbootcamp可以被容易集成进各主流强化学习或合成数据框架，具体见[xPuyu集成示例](examples/xpuyu_usage/README.md)。
-## 五、Internbootcamp支持下的推理模型训练<a id="section5"></a>
+InternBootcamp 可以轻松与 RL 框架集成。详见 [examples/README.md](examples/README.md)。
 Internbootcamp为推理模型训练的探索提供了数量无限、任务多样、难度可控的数据支持，我们基于此开展了一系列探索，部分代表性结果展示如下。
-### DEMO: InternThinker-Go
+## 🧪 实验：通过可验证的任务缩放提升 LLM 推理
-时至今日，LLM已经在各类常见的推理任务上展现出了令人惊叹的性能。然而，作为引爆AI热度的最早研究问题之一，通用LLM在**围棋**这一细分领域的推理能力却少有工作展开研究。AlphaZero以"[Mastering the Game of Go without Human Knowledge](https://discovery.ucl.ac.uk/10045895/1/agz_unformatted_nature.pdf)"的视角对围棋领域的人类智能提出了挑战，而我们则探索了如何将人类智能带回围棋这个古老的游戏，让独属于人类的自然语言思维模式在LLM的新背景下重新闪光。基于InternBootcamp，我们实现了用于推理模型强化学习的围棋训练场，使用围棋专业领域数据冷启动，通过强化学习以强化模型的推理范式。我们的模型达到了媲美职业棋手的性能，Internthinker-GO能够稳定击败星阵AI-业余6段，并与星阵AI的职业1星水平基本接近，这也是首个能够达到这一性能的通用大语言模型。
+我们进行了大量实验，以研究**任务缩放**（使用越来越多和多样化的推理任务进行训练）如何增强大语言模型的推理能力。我们的研究结果表明，任务缩放不仅提高了最终性能，而且显著提升了训练效率。
 <p align="center">
-<img src="figs/demo_internthinkerGO.png" width=1200/> 
+  <img src="./figs/overall.png" width="800"/> 
 </p>
-<!-- 对当前局面，模型能够进行合理的分析、预测和总结，并最终选择出合理的下一步。 -->
+通过系统地缩放训练任务，我们观察到模型在不同推理领域的性能持续提升。使用更多任务训练的模型在我们的 Bootcamp-Eval 基准测试上实现了更好的泛化能力和更高的准确率，展示了任务缩放在开发多功能推理模型方面的有效性。同时，我们发现扩增训练任务的数量能够在强化学习过程中有效提升训练的效率。
-### 大规模混合任务训练
+此外，我们发现多任务训练能够实现一个**涌现时刻**（Emergent Moment）——当与其他任务一起训练时，那些孤立情况下无法解决的任务突然变得可学习。这种现象表明，跨任务知识转移培养了潜在的泛化能力，使模型能够应对原本无法解决的复杂挑战。
 我们在80个已校验的训练场上基于`Deepseek-R1-Distill-Qwen-32B`使用GRPO展开训练，并进行评测。我们的结果显示：
 - 当前推理模型在训练场任务上性能仍有较大提升空间，而通过大规模任务混合的强化学习训练可以有效对其提升；
 - 有趣的是，我们发现进过通过训练场任务的训练也能有效提升模型在数学、专业知识等测试基准上的性能，揭示了训练场任务作为提升通用推理能力的有效手段；
 - 与此同时，我们发现了大规模任务混合训练带来的“涌现时刻 (Emerging Moment)”，部分无法单独训练习得的任务在混合训练一定步数后性能曲线突然上涨，并逐渐掌握，表明了规模化训练任务作为隐式课程学习激活困难任务的潜力。
 具体现象如下。
 **训练场任务性能**  我们在已校验训练场上，对于每个任务随机划分难度配置，得到训练、测试数据（`examples/data/InternBootcamp_eval`）。
 <p align="center">
 <img src="figs/bootcamp_perf.png" width=500/> 
 </p>
 - 我们首先测试了主流模型在测试数据上的表现，结果表明，现有模型在我们的训练场推理任务上仍有较大提升空间，部分推理模型相比其对应的非推理模型版本未展示出明显优势。
 - 在此基础上，我们探索是否可以通过强化学习训练提升模型在这些训练场任务上的性能。对此，我们在`Deepseek-R1-Distill-Qwen-32B`基础上使用划分的训练数据进行强化学习，仅通过22k个问题的训练即获得了28%的准确率提升，超越`OpenAI o3-mini`、`Deepseek-R1`和`Claude-3.7-Sonnet`，展现了强化学习在训练场任务上的有效性。
 ||Qwen-32B-Instruct|Deepseek-R1-Distill-Qwen-32B|QwQ-32B|Deepseek-R1|Claude-3.7-Sonnet (w/o thinking)|Claude-3.7-Sonnet (w/ thinking)|OpenAI o3-mini|Ours|
 |-|-|-|-|-|-|-|-|-|
 |训练场测试集准确率 (%)| 30.0|30.5|23.0|45.3| 39.2| 54.5|41.3| **58.2**|
 **涌现时刻** 我们发现，得益于任务间的泛化，大量任务的混合训练可以使得部分过于困难而无法被单独训练的任务得到提升。如下所示，在单独训练时由于模型始终难以采样到正确回复，Tapa/KorOperationUnicode25ce任务无法得到训练，而在与其他任务混合训练一定步数后，模型在对应任务上能力得到激活，成功采样到正确的回复，进一步得到显著的性能提升。这显示了大规模任务混合训练在激活困难任务上的潜力。
 <p align="center">
-  <img src="figs/heartbeat_tapa.png" width=350/>  <img src="figs/heartbeat_kor25ce.png" width=350/>
+  <img src="./figs/emergent.png" width="800"/> 
 </p>
 📌 详细的实验结果和全面分析，欢迎参阅我们的[技术报告](https://arxiv.org/pdf/2508.08636) 📝。
-## 引用
+## ⚫ 演示：[InternThinker-围棋](https://intern.openxlab.org.cn/internthinker/go-game)
-如果我们的代码库对您的工作有帮助，请引用
+LLMs 在广泛的常见推理任务上已经展现出卓越的性能。然而，作为最早引发 AI 热潮的研究问题之一，通用 LLMs 在**围棋**这一特定领域的推理能力却很少受到研究关注。虽然 AlphaZero 从“无需人类知识掌握围棋游戏”的角度挑战了人类智能，但我们探索如何将人类智能带回这个古老的游戏，让人类独有的自然语言思维模式在 LLMs 的新背景下再次闪耀。基于 InternBootcamp，我们实现了一个用于推理模型强化学习的围棋 bootcamp，使用专业围棋领域数据冷启动，并通过强化学习强化了模型的推理范式。我们的模型达到了与职业围棋选手相当的性能 - InternThinker-GO 可以稳定击败业余 6 段水平的 Golaxy AI 并接近职业 1 星水平，使其成为首个达到此性能水平的通用大语言模型。
-```
+
-@software{Internbootcamp,
+<p align="center">
-  author = {Internbootcamp Team},
+  <img src="figs/demo_internthinkerGO.png" width="1000"/> 
-  title = {Internbootcamp},
+</p>
-  version = {0.0.4},
+
-  month = {4},
+对于给定的局面，InternThinker-GO 首先分析棋盘形势：*“右上角存在复杂的战斗区域，黑白双方都有多颗棋子。左下角有一些交错的的黑白棋子，形成了一定的结构。黑棋沿着左边有阵型。黑棋刚刚在第 65 手下了 B14，这明显是为了侵入白棋左边的地盘。作为白棋，我需要小心应对这个威胁。”* 接下来，InternThinker-GO 具体预测并分析了 B13、C13 等潜在落点，并最终选择 B13 作为落子位置。
-  year = {2025}
+
 ## 🙏 致谢
 我们向以下工作表示感谢，它们为本项目提供了重要的启发和工具支持：
 - [Intern-S1](https://github.com/InternLM/Intern-S1)
 - [VeRL](https://github.com/volcengine/verl)
 - [Xtuner](https://github.com/InternLM/xtuner)
 - [OpenCompass](https://github.com/open-compass/opencompass)
 ## 📜 引用
 如果您觉得我们的工作有帮助，请引用：
 ```bibtex
@misc{li2025internbootcamptechnicalreportboosting,
      title={InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling}, 
      author={Peiji Li and Jiasheng Ye and Yongkang Chen and Yichuan Ma and Zijie Yu and Kedi Chen and Ganqu Cui and Haozhan Li and Jiacheng Chen and Chengqi Lyu and Wenwei Zhang and Linyang Li and Qipeng Guo and Dahua Lin and Bowen Zhou and Kai Chen},
      year={2025},
      eprint={2508.08636},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.08636}, 
 }
-```
+```
--- a/figs/bootcamp-framework.png
+++ b/figs/bootcamp-framework.png
--- a/figs/bootcamps.png
+++ b/figs/bootcamps.png
--- a/figs/emergent.png
+++ b/figs/emergent.png
--- a/figs/overall.png
+++ b/figs/overall.png
--- a/figs/workflow.png
+++ b/figs/workflow.png