mirror of
https://github.com/InternLM/InternBootcamp.git
synced 2026-04-19 12:58:04 +00:00
Update Readme, release v1.0 (#12)
This commit is contained in:
parent
a8249acc18
commit
f231ec6fa0
7 changed files with 288 additions and 235 deletions
261
README.md
261
README.md
|
|
@ -1,186 +1,219 @@
|
|||
# Internbootcamp
|
||||
# InternBootcamp
|
||||
|
||||
English | [中文](./README_zh.md)
|
||||
|
||||
<p align="center">
|
||||
<a href="https://arxiv.org/pdf/2508.08636">📄 Paper</a> •
|
||||
<a href="https://github.com/InternLM/InternBootcamp">⭐ Github</a> •
|
||||
<a href="examples/data/Intenbootcamp_eval">📊 Evaluation</a> •
|
||||
<a href="https://intern.openxlab.org.cn/internthinker/go-game">⚪ Internthinker-Go</a>
|
||||
</p>
|
||||
|
||||
|
||||
<p align="center">
|
||||
🌍 <a href="./README.md">English</a> | <a href="./README_zh.md">简体中文</a>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img src="./figs/logo.png " width=400/>
|
||||
</p>
|
||||
|
||||
|
||||
Internbootcamp is an easy-to-use and extensible library of bootcamp environments for training large reasoning models. By integrating **a comprehensive collection of verifiable tasks with unlimited automatic question generation and result verification**, Internbootcamp aims to provide data that enhances reasoning capabilities and their generalizability across diverse scenarios. Currently, Internbootcamp includes over one thousand verifiable reasoning tasks, covering problems related to logic, puzzles, algorithms, games, and more. We are continuing efforts to expand its scope with the community.
|
||||
InternBootcamp is an open-source framework comprising **1000+ domain-diverse task environments** specifically designed for LLM reasoning research. By integrating automated generation of unlimited training/testing cases with configurable difficulty levels and integrated verification modules, InternBootcamp serves as fundamental infrastructure for **RL-based model optimization**, **synthetic data generation**, and **model evaluation**.
|
||||
|
||||
## Getting Started
|
||||
Our key innovation lies in demonstrating that scaling the number of verifiable reasoning tasks during training significantly enhances both reasoning performance and training efficiency—a phenomenon we term **"Task Scaling"** 📈. Currently, InternBootcamp includes verifiable reasoning tasks across 8 diverse domains, covering problems related to algorithms, cryptography, natural science, language analysis, mathematical modeling, graphical puzzles, logical reasoning, and character puzzles. We are continuing efforts to expand its scope with the community.
|
||||
|
||||
**Quickly get started with data generation, reinforcement learning training, model evaluation, and custom Bootcamp creation!**
|
||||
* [Installation](#section3)
|
||||
* [Quick Start](examples/get_started.md)
|
||||
* [Interfaces and Usages](#section4)
|
||||
## 🚀 Getting Started
|
||||
|
||||
Quickly get started with data generation, reinforcement learning training, model evaluation, and custom Bootcamp creation!
|
||||
|
||||
- [Installation](#installation)
|
||||
- [Quick Start](examples/get_started.md)
|
||||
- [Interfaces and Usages](#usage)
|
||||
- [Bootcamp-Eval Dataset](#examples/data/Intenbootcamp_eval)
|
||||
|
||||
## 📢 Update
|
||||
|
||||
- 📦 **[2025/08] v1.0 released!**
|
||||
- 📄 **[2025/08] [InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling](https://arxiv.org/pdf/2508.08636) released.**
|
||||
- 🌱 **[2025/04] v0.1 released.**
|
||||
|
||||
|
||||
## Update
|
||||
|
||||
[2025/04] v0.1 released.
|
||||
## 🧩 About
|
||||
|
||||
## 1 About
|
||||
Large-scale reinforcement learning has been demonstrated to be an effective way towards expert-level reasoning models. Most current efforts to advance this technical routine concentrate on limited tasks, such as math, and focus on devising improved training algorithms. Complementary, we believe the investigation into **Task Scaling**—exposing models to a wide and growing spectrum of reasoning tasks—is essential for building general and robust reasoning models:
|
||||
|
||||
Large-scale reinforcement learning has been demonstrated to be an effective way towards expert-level reasoning models. Most current efforts to advance this technical routine concentrate on limited tasks, such as math, and focus on devising improved training algorithms. Complementary, we believe the investigation into **data** is also necessary for an esteemed reasoning model:
|
||||
- We could include more types of tasks to cover more diverse reasoning patterns, to build a more general reasoning model;
|
||||
- We could study tasks with more controllable difficulties and their combinations to facilitate our understanding of the training dynamics, to explore more efficient training strategies.
|
||||
- Including more types of tasks covers diverse reasoning patterns, leading to more general intelligence;
|
||||
- Studying tasks with controllable difficulties and their combinations facilitates understanding of training dynamics and enables more efficient training strategies.
|
||||
|
||||
Despite the abundance of potentially valuable tasks available, their dispersed distribution across various sources makes it exceptionally difficult for practitioners to utilize them. To this end, we introduce Internbootcamp to facilitate related investigations and provide engineering convenience. In particular, we would like to highlight the following features of Internbootcamp:
|
||||
- **Standardized: Internbootcamp provides a unified interface for various tasks and is easy to integrate with different codebases for reinforcement learning or synthetic data.** In Internbootcamp, we define each task as a bootcamp class, which allows parameters to control the task difficulties. Each bootcamp class implements a unified interface to generate questions and verify the solutions. With the unified interface, tasks in Internbootcamp can be easily combined with codebases for reinforcement learning or synthetic data.
|
||||
- **Scalable: thanks to an automatic workflow for bootcamp synthesis, InternBootcamp has included a large volume of diverse bootcamp tasks.** In the first release, InternBootcamp has been covering over 1000 complex reasoning tasks, including various task types such as games, logic problems, puzzles, algorithms, and more. Among these bootcamps, over 96% were developed through an automated workflow for synthesis and quality filtration, enabling the continuous scaling of bootcamp environments with low intervention.
|
||||
- **Extensible: Internbootcamp can be extended to support more diverse and complicated tasks (e.g., tasks with multiturn interaction like Go and agents) and provide them with question generation and results verification.** Representatively, we include the `InternGObootcamp` as a demonstration.
|
||||
Despite the abundance of potentially valuable tasks available, their dispersed distribution across various sources makes it exceptionally difficult for practitioners to utilize them. To this end, we introduce InternBootcamp to facilitate related investigations and provide engineering convenience. In particular, we would like to highlight the following features of InternBootcamp:
|
||||
|
||||
We also conduct a series of investigations into reinforcement learning using Internbootcamp. Our preliminary findings are as follows:
|
||||
- As a representative of single-task training, we train `InternThinker-GO` with the InternGObootcamp in Internbootcamp. `InternThinker-GO` approaches professional players using far fewer games than AlphaGO, which surpasses current reasoning models. Besides excellent performance, `InternThinker-GO` can provide reasonable and inspiring thoughts, demonstrating the great potential of human-like reasoning, empowered with reinforcement learning, in tackling expert-level tasks.
|
||||
- Besides, mixing the bootcamps together, we find that there remains a large room for current reasoning models to improve in bootcamp tasks, but reinforcement learning can effectively improve their performance. Through training on merely 22k prompts mixed with all verified tasks, the outcome models successfully improve by 27% based on `Deepseek-R1-Distill-Qwen-32B`, surpassing frontier reasoning models like `Claude-3.7-Sonnet` and `Deepseek-R1`, and continually improves with more training steps.
|
||||
- Noteworthy, reinforcement learning on bootcamp tasks can even lead to consistent improvement in benchmarks for general reasoning, such as expert knowledge, math, and coding. Moreover, in multitask training, we find that the generalization across tasks brings the "emerging moment" when some challenging task that fails to improve in single-task training succeeds to be learnt in mixed training, unveiling the potential benefits of scaling training tasks.
|
||||
- **🔧 Standardized:** InternBootcamp provides a unified interface for various tasks and is easy to integrate with different codebases for reinforcement learning or synthetic data. Each bootcamp class implements standardized methods to generate questions and verify solutions, allowing seamless integration with reinforcement learning or synthetic data pipelines.
|
||||
|
||||
## 2 Supported bootcamps<a id="section2"></a>
|
||||
- **📊 Scalable:** Thanks to an automatic agent workflow for bootcamp synthesis, InternBootcamp has grown to include a large volume of diverse bootcamp tasks. In the first release, it covers over 1000 complex reasoning tasks across 8 domains, including games, logic problems, puzzles, algorithms, scientific reasoning, and more. Over 90% of these bootcamps were developed through an automated synthesis and quality filtration pipeline, enabling continuous scaling of bootcamp environments with minimal human intervention.
|
||||
|
||||
In the first release, Internbootcamp has covered bootcamps for 1060 tasks, coming from:
|
||||
- **Benchmarks for reasoning.** Currently, we have included tasks from [ARC-AGI](https://github.com/fchollet/ARC-AGI), [KOR-Bench](https://kor-bench.github.io/), and [BBEH](https://github.com/google-deepmind/bbeh), three representative reasoning benchmarks, to build bootcamps. Here, KOR-Bench includes five types of reasoning tasks, namly logic, operation, cipher, puzzle, and counterfactual reasoning, where we neglect counterfactual reasoning for its dependance on specific world-view knowledge and build bootcamps for the remaining four types of tasks. BBEH is 23 reasoning tasks obtained by complicating tasks from BBH, and we build bootcamps for tasks do not depend on external knowledge.
|
||||
- **Puzzle websites.** [puzzle-xxx](https://www.puzzle-aquarium.com/) is a series of webpages of puzzles, we scratch 39 puzzles on it to prepare corresponding bootcamps.
|
||||
- **Algorithm problems.** Algorithm problems cover reasoning patterns in various algorithms, and they contain questions that are close to real-world applications. Meanwhile, algorithm problems in the wild usually contain reference solutions, making it easy to convert them into bootcamps. Currently, we use [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) and select 1265 tasks with medium difficulties (codeforces points between 1000 and 2000) and apply our automatic workflow to construct corresponding bootcamps. Additionally, we adapted tasks from [CodeIO](https://codei-o.github.io/), which translates code-based reasoning into natural language to assess large language models' reasoning capabilities.
|
||||
- **Benchmarks for programming capabiltiy.** Currently we have included tasks from [BigCodeBench](https://bigcode-bench.github.io/) and [KodCode](https://kodcode-ai.github.io/), two representative programming benchmarks, to build bootcamps. These benchmarks feature diverse and challenging problems that require language models to generate correct code. For each task, we collected or adapted a `unittest` script to validate solution correctness.
|
||||
- **Instruction following.** These tasks test a model's ability to comprehend and strictly adhere to instructions embedded in task descriptions. In many cases, correctness can be evaluated through code execution feedback. We included tasks from [AutoIF](https://github.com/QwenLM/AutoIF), which contains over 60,000 instruction–evaluation function pairs, each treated as an individual task.
|
||||
- **Games.** Games are a kind of complex reasoning tasks involving multiturn interactions featuring controllable and verifiable objectives. As a representative, we built `InternGObootcamp` to train a reasoning model for Go.
|
||||
- **🧱 Extensible:** InternBootcamp can be extended to support more diverse and complicated tasks (e.g., tasks with multi-turn interaction like Go and agent-based environments) and provide them with question generation and results verification. Representatively, we include `InternGObootcamp` as a demonstration.
|
||||
|
||||
Among all the bootcamps, 84 of them are implemented or verified by human programmers, 978 are produced with an automatic workflow, and there are also 2 bootcamps under development for special purposes.
|
||||
We also conduct a series of investigations into reinforcement learning using InternBootcamp. Our preliminary findings are as follows:
|
||||
|
||||
We are continuing our efforts and call for the community to verify the automatically generated bootcamps. We present the full list of bootcamps ([the full bootcamp list](./Fulllist_InternBootcamp.md))and illustrate our automatic workflow below.
|
||||
For specific purposes, we design Arcbootcamp,KORbootcamp、BBEHbootcamp and InternGobootcamp.
|
||||
Here, Arcbootcamp serves for optimizing performance on [ARC-AGI](https://github.com/fchollet/ARC-AGI). To develop this bootcamp, we largely refer to the implementation in [michaelhodel/re-arc](https://github.com/michaelhodel/re-arc).
|
||||
- **Scalable task synthesis enables broad experiential learning**: Our automated agent workflow demonstrates that large-scale, diverse reasoning environments can be effectively synthesized via iterative, evolutionary methods, opening the door to training agents on a continuous stream of novel tasks.
|
||||
|
||||
### bootcamps from Automatic Workflow
|
||||
A vast number of tasks can be built into bootcamps. Writing bootcamp code for each task manually is hugely inefficient and hardly scalable. To this end, we introduce an automated workflow that leverages the increasingly mature code generation capabilities of large language models to generate bootcamp code from task descriptions to achieve large-scale bootcamp construction. This process consists of three stages: (1) Task Description Collection; (2) Code Generation; (3) bootcamp Validation and Filtering.
|
||||
- **Generalization emerges from cross-task exposure**: LLMs develop stronger reasoning generalization and emergent abilities not through deep specialization in narrow domains, but by learning across a wide spectrum of reasoning tasks.
|
||||
|
||||
The full list of bootcamps generated by an automatic workflow can be found in [the full bootcamp list](./Fulllist_InternBootcamp.md). We illustrate the details of the automatic workflow as follows.
|
||||
- **Task scaling improves both performance and efficiency**: Increasing the number of training tasks significantly boosts both final performance and learning efficiency, with a near-linear relationship between task quantity and reasoning capability.
|
||||
|
||||
- **InternThinker-GO**: As a representative of single-task training, we train `InternThinker-GO` with the InternGObootcamp. `InternThinker-GO` approaches professional players using far fewer games than AlphaGO, surpassing current reasoning models. Besides excellent performance, `InternThinker-GO` provides reasonable and inspiring thoughts, demonstrating the great potential of human-like reasoning empowered by reinforcement learning in tackling expert-level tasks.
|
||||
|
||||
## 🎯 Supported Bootcamps
|
||||
|
||||
<p align="center">
|
||||
<img src="./figs/auto_pipeline.png " width=600/>
|
||||
<img src="./figs/bootcamps.png" width="1000"/>
|
||||
</p>
|
||||
|
||||
**Collecting Task Descriptions:** First, we identify a series of tasks whose questions can be automatically generated and verified. These include, but are not limited to: (1) Puzzles, such as those from the puzzle-xxx websites; (2) Benchmarks for reasoning, like ARC-AGI, KOR, and BBEH; (3) Algorithm problems, which can be solved by specific algorithms and whose problem statements resemble real-world application problems. We collect descriptions of these problems and any available supporting information, such as sample questions, example solution programs, etc., to constitute the task description.
|
||||
In the first release, InternBootcamp has covered bootcamps for **1000+ tasks**, coming from:
|
||||
|
||||
**Code Generation:** We utilize frontier coding models, e.g., Deepseek-R1, to generate the codes of the bootcamps. We provide the model with the required interfaces with their descriptions, along with the task descriptions collected in the previous step. The model then synthesizes the bootcamp code. In early attempts, we found that some model-synthesized code suffered from issues like oversimplification or runtime errors. To address this, we adopted a multi-round synthesis approach. After each round of synthesizing bootcamp code, we feed the code's runtime results back to the model for further rounds of generation. We also sample multiple times in parallel for each task to ensure the success rate.
|
||||
- **🧠 Benchmarks for reasoning:** Currently, we have included tasks from [ARC-AGI](https://github.com/fchollet/ARC-AGI), [re-arc](https://github.com/michaelhodel/re-arc), [KOR-Bench](https://kor-bench.github.io/), and [BBEH](https://github.com/google-deepmind/bbeh), three representative reasoning benchmarks, to build bootcamps. Here, KOR-Bench includes five types of reasoning tasks, namely logic, operation, cipher, puzzle, and counterfactual reasoning, where we neglect counterfactual reasoning for its dependence on specific world-view knowledge and build bootcamps for the remaining four types of tasks. BBEH is 23 reasoning tasks obtained by complicating tasks from BBH, and we build bootcamps for tasks that do not depend on external knowledge.
|
||||
|
||||
**bootcamp Validation and Filtering:** For the generated bootcamp code, we designed a testing procedure to ensure the code can run correctly for large-scale data generation and result validation. Specifically, we sample questions with synthetic bootcamps and test the score achieved by reasoning LLMs, also using the verification codes generated. Based on the program's execution results and evaluated scores, we exclude bootcamp environments that contain obvious errors (e.g., execution errors or scored 0) or are overly simplified (accuracy close to 1).
|
||||
- **🧩 Puzzle websites:** [puzzle-xxx](https://www.puzzle-aquarium.com/) is a series of webpages of puzzles; we scratch 39 puzzles on it to prepare corresponding bootcamps.
|
||||
|
||||
- **⚙️ Algorithm problems:** Algorithm problems cover reasoning patterns in various algorithms, and they contain questions that are close to real-world applications. Meanwhile, algorithm problems in the wild usually contain reference solutions, making it easy to convert them into bootcamps. Currently, we use [CodeContests](https://huggingface.co/datasets/deepmind/code_contests) and select 1265 tasks with medium difficulties (codeforces points between 1000 and 2000) and apply our automatic workflow to construct corresponding bootcamps. Additionally, we adapted tasks from [CodeIO](https://codei-o.github.io/), which translates code-based reasoning into natural language to assess large language models' reasoning capabilities.
|
||||
|
||||
- **💻 Benchmarks for programming capability:** Currently, we have included tasks from [BigCodeBench](https://bigcode-bench.github.io/) and [KodCode](https://kodcode-ai.github.io/), two representative programming benchmarks, to build bootcamps. These benchmarks feature diverse and challenging problems that require language models to generate correct code. For each task, we collected or adapted a `unittest` script to validate solution correctness.
|
||||
|
||||
- **📋 Instruction following:** These tasks test a model's ability to comprehend and strictly adhere to instructions embedded in task descriptions. In many cases, correctness can be evaluated through code execution feedback. We included tasks from [AutoIF](https://github.com/QwenLM/AutoIF), which contains over 60,000 instruction–evaluation function pairs, each treated as an individual task.
|
||||
|
||||
- **🎮 Games:** Games are a kind of complex reasoning tasks involving multi-turn interactions featuring controllable and verifiable objectives. As a representative, we built `InternGObootcamp` to train a reasoning model for Go.
|
||||
|
||||
- **🔬 Scientific tasks:** Scientific tasks represent a spectrum of reasoning-intensive endeavors deeply intertwined with scientific research activities, which are regarded as one of the most valuable domains where AI will revolutionize. We consider that improving reasoning models on these tasks facilitates the achievement of this vision. Part of the scientific task collection is supported by the [Intern-S1](https://arxiv.org/pdf/2508.15763) team, and in return, InternBootcamp also provides training support for Intern-S1.
|
||||
|
||||
We are continuing our efforts and call for the community to verify the automatically generated bootcamps. We present the full list of bootcamps ([the full bootcamp list](./Fulllist_InternBootcamp.md)) and illustrate our automatic workflow below.
|
||||
|
||||
|
||||
## 3 Installation<a id="section3"></a>
|
||||
## 🤖 Automatic Agent Workflow for Large-Scale Bootcamp Synthesis
|
||||
|
||||
```
|
||||
git clone https://github.com/InternLM/Internbootcamp.git
|
||||
cd internbootcamp
|
||||
<p align="center">
|
||||
<img src="./figs/workflow.png" width="1000"/>
|
||||
</p>
|
||||
|
||||
Manually coding bootcamps for each task is inefficient and not scalable. We introduce an **automatic agent workflow** that leverages large language models to generate bootcamp code from task descriptions. This pipeline involves:
|
||||
|
||||
1. **📥 Task Description Collection:** Identify verifiable tasks (puzzles, reasoning benchmarks, algorithm problems, etc.) and collect their descriptions and supporting information.
|
||||
2. **🔄 Evolutionary Code Generation:** Use strong coding models (e.g., Deepseek-R1) to generate bootcamp code iteratively, incorporating execution feedback to avoid oversimplification and errors.
|
||||
3. **✅ Self-Consistent Unittest Filtering:** Filter bootcamps by evaluating LLM responses using the `verify_function` as unittests. Bootcamps with accuracy outside [0.03, 0.85] are filtered out.
|
||||
|
||||
This workflow has enabled rapid expansion to 1000+ bootcamps with high quality and diversity.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## 🛠 Interfaces and Usages
|
||||
|
||||
Each bootcamp inherits from `BaseBootcamp` and contains three main interfaces: `case_generator`, `prompt_func`, and `verify_func`, to serve for question generation and result verification.
|
||||
|
||||
<p align="center">
|
||||
<img src="./figs/bootcamp-framework.png" width="1000"/>
|
||||
</p>
|
||||
|
||||
### Installation <a id="installation"></a>
|
||||
|
||||
```bash
|
||||
git clone https://github.com/InternLM/InternBootcamp.git
|
||||
cd InternBootcamp
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## 4 Interfaces and Usages<a id="section4"></a>
|
||||
### Example: Game24Bootcamp <a id="usage"></a>
|
||||
|
||||
Each bootcamp inherits from Basebootcamp and contains three main interfaces: `case_generator`, `prompt_func`, and `verify_score`, to serve for question generation and result verification.
|
||||
|
||||
<p align="center">
|
||||
<img src="./figs/interface.png " width=600/>
|
||||
</p>
|
||||
|
||||
### Example (Game24bootcamp)
|
||||
We introduce the usage of the bootcamp with `Game24bootcamp` as an example. Other supported bootcamps can be found in [the full bootcamp list](./Fulllist_InternBootcamp.md).
|
||||
|
||||
Game24 is a simple arithmetic puzzle where you are given `num_numbers` numbers, each up to a value of `range_max`. The goal is to obtain the `target` value using the four basic arithmetic operations, where the `target` value does not exceed `target_max`.
|
||||
Game24 is an arithmetic puzzle where you use `num_numbers` numbers (each ≤ `range_max`) and basic operations to obtain the `target` value (≤ `target_max`).
|
||||
|
||||
#### Generating Questions
|
||||
First, let's instantiate the bootcamp:
|
||||
```python
|
||||
from internbootcamp import Game24bootcamp
|
||||
|
||||
# You can specify the problem difficulty
|
||||
bootcamp = Game24bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)
|
||||
|
||||
# You can also use the default parameter configuration
|
||||
# bootcamp_default = Game24bootcamp()
|
||||
```
|
||||
|
||||
Then, use the `case_generator` interface (method) to generate a problem instance, and use `prompt_func` to convert the instance into a natural language question:
|
||||
|
||||
```python
|
||||
from internbootcamp import Game24Bootcamp
|
||||
|
||||
# Specify difficulty parameters
|
||||
bootcamp = Game24Bootcamp(num_numbers=4, range_max=100, target_max=100, seed=42)
|
||||
|
||||
# Or use default configuration
|
||||
# bootcamp_default = Game24Bootcamp()
|
||||
|
||||
identity = bootcamp.case_generator()
|
||||
prompt = bootcamp.prompt_func(identity)
|
||||
|
||||
# Example Output:
|
||||
# - identity: {'puzzle': '8 43 65 77', 'target': 28}
|
||||
# - prompt: "Please solve the puzzle:using 8 43 65 77 to obtain 28 through basic arithmetics Let's hink step by step and output the final answer within \\boxed{}.The final answer should be all input numbers with basic operations, and parentheses can be used to change the order of operations. For example \"Final Answer: \\boxed{6+6+(6+6)}\"."
|
||||
# - prompt: "Please solve the puzzle: using 8 43 65 77 to obtain 28 through basic arithmetics..."
|
||||
```
|
||||
|
||||
#### Verifying Results
|
||||
After obtaining the response to the question, use the `verify_score` interface (method) to score the result. By default, a score of 1 is returned for a correct result, and a score of 0 is returned for an incorrect result. Additionally, you can specify format_score (defaulting to 0), which is the score returned only if the response format is correct.
|
||||
|
||||
```python
|
||||
response = "...some reasoning process...\\boxed{77 / (65 - 43) * 8}"
|
||||
score = Game24bootcamp.verify_score(response, identity, format_score=0.1)
|
||||
score = Game24Bootcamp.verify_score(response, identity, format_score=0.1)
|
||||
```
|
||||
|
||||
### Extending to More Tasks
|
||||
Based on the interfaces described above, Internbootcamp can be easily extended with new tasks. See [examples/README.md](examples/README.md) for details.
|
||||
|
||||
New bootcamps can be created by inheriting from `BaseBootcamp` and implementing its core interfaces. See [examples/README.md](examples/README.md) for details.
|
||||
|
||||
### Reinforcement Learning
|
||||
Internbootcamp can be easily integrated into various mainstream reinforcement learning or synthetic data frameworks. See [examples/bootcamp-rl/README.md](examples/xpuyu_usage/README.md) for details.
|
||||
|
||||
InternBootcamp integrates easily with RL frameworks. See [examples/README.md](examples/README.md) for details.
|
||||
|
||||
|
||||
## 5 Training Reasoning Models with Internbootcamp
|
||||
## 🧪 Experiments: Boosting LLM Reasoning with Verifiable Task Scaling
|
||||
|
||||
Internbootcamp provides training data with infinite quantities, diverse task coverage, and controllable difficulties for training reasoning models. Based on this, we conduct a series of explorations with some preliminary results as follows.
|
||||
|
||||
### DEMO: InternThinker-GO
|
||||
LLMs have demonstrated remarkable performance across a wide range of common reasoning tasks. However, as one of the earliest research problems that ignited the AI boom, the reasoning capabilities of general-purpose LLMs in the specific domain of **Go** have received little research attention. While AlphaZero challenged human intelligence in the Go domain from the perspective of "Mastering the Game of Go without Human Knowledge," we explore how to bring human intelligence back to this ancient game, allowing the natural language thinking patterns unique to humans to shine again in the new context of LLMs. Based on Internbootcamp, we implemented a Go bootcamp for reinforcement learning of reasoning models, cold-started with professional Go domain data, and reinforced the model's reasoning paradigm through reinforcement learning. Our model achieves performance comparable to professional Go players - InternThinker-GO can consistently defeat Golaxy AI at amateur 6-dan level and approaches the professional 1-star level, making it the first general large language model to reach this level of performance.
|
||||
We conduct extensive experiments to investigate how **Task Scaling**—training with an increasing number and diversity of reasoning tasks—enhances the reasoning capabilities of large language models. Our findings demonstrate that Task Scaling not only improves final performance but also significantly boosts training efficiency.
|
||||
|
||||
<p align="center">
|
||||
<img src="figs/demo_internthinkerGO.png" width=1200/>
|
||||
<img src="./figs/overall.png" width="800"/>
|
||||
</p>
|
||||
|
||||
For a given state, InternThinker-GO first analyzes the situation on the board: "There is a complex battle area in the upper right corner, where both Black and White have multiple stones. The lower left corner has some interlaced Black and White stones, forming a certain structure. Black has a formation along the left edge. Black just played move 65 at B14, which is clearly intended to invade White's territory on the left side. As White, I need to respond carefully to this threat." Next, InternThinker-GO specifically predicts and analyzes potential moves such as B13, C13, and ultimately selects B13 as the placement position.
|
||||
Through systematic scaling of training tasks, we observe consistent improvements in model performance across diverse reasoning domains. Models trained with more tasks achieve better generalization and higher accuracy on our Bootcamp-Eval benchmark, showcasing the effectiveness of Task Scaling in developing versatile reasoning models. Besides, scaling the number of training tasks also enhances training efficiency in RLVR process.
|
||||
|
||||
|
||||
### Training with Task Mixing
|
||||
|
||||
We train and evaluate our model on a mix of 80 validated bootcamps based on `Deepseek-R1-Distill-Qwen-32B`, using GRPO. Our results show:
|
||||
- Frontier reasoning models still have a large room for improvement on bootcamp tasks, while reinforcement learning on these tasks is an effective way for this.
|
||||
- Interestingly, we find models trained on bootcamp tasks also get improvement on general reasoning benchmarks with unseen tasks such as math and professional knowledge QA, unveiling that bootcamp tasks can be effective for improving general reasoning capabilities.
|
||||
- Meanwhile, we find training with task mixing fosters the "**emerging moment**", where some tasks that fail to learn alone suddenly gain improvement in mixed training after a certain training step. This demonstrates the potential of scaling training tasks, serving as a kind of implicit curriculum learning, to elicit challenging tasks.
|
||||
Details are as follows.
|
||||
|
||||
**Performance on bootcamp Tasks** We randomly divided the difficulty configurations to generate data for the training and test sets on the validated bootcamps(`examples/data/InternBootcamp_eval`).
|
||||
<p align="center">
|
||||
<img src="figs/bootcamp_perf.png" width=500/>
|
||||
</p>
|
||||
|
||||
* With the data, we first evaluate the performance of mainstream reasoning models. It shows that these models still have a large room to improve on our bootcamp tasks, and some reasoning models do not show an obvious advantage over their non-reasoning version.
|
||||
* Accordingly, we study whether we can effectively improve model performance on these bootcamp tasks with reinforcement learning. Specifically, we use the generated training data to train `Deepseek-R1-Distill-Qwen-32B`. With 22k prompts, the accuracy on the test set considerably increases by 27%, surpassing `OpenAI o3-mini`, `Deepseek-R1` and `Claude-3.7-Sonnet`, demonstrating the effectiveness of reinforcement learning in enhancing model ability on these tasks.
|
||||
|
||||
|
||||
||Qwen-32B-Instruct|Deepseek-R1-Distill-Qwen-32B|QwQ-32B|Deepseek-R1|Claude-3.7-Sonnet (w/o thinking)|Claude-3.7-Sonnet (w/ thinking)|OpenAI o3-mini|Ours|
|
||||
|-|-|-|-|-|-|-|-|-|
|
||||
|Accuracy on bootcamp Testset (%)| 30.0|30.5|23.0|45.3| 39.2| 54.5|41.3| **58.2**|
|
||||
|
||||
|
||||
**Emerging Moment** Notably, benefiting from intertask generalization, we find that training on a mixture of diverse tasks elicits the performance of challenging tasks that fail to learn when trained alone improves. As demonstrated in the figure, in the Tapa/KorOperationUnicdoe25ce task, the model hardly samples a correct solution, thus failing to learn when trained alone. Nonetheless, its performance gets elicited and then improves after a few steps of mixed training. This demonstrates the potential of scaling training tasks in reinforcement learning to bring implicit curriculum learning for tackling challenging tasks.
|
||||
Additionally, we discover that multi-task training enables an **Emergent Moment**—where tasks that are unsolvable in isolation suddenly become learnable when trained together with other tasks. This phenomenon demonstrates that cross-task knowledge transfer fosters latent generalization capabilities, allowing models to tackle complex challenges that would otherwise remain out of reach.
|
||||
|
||||
<p align="center">
|
||||
<img src="figs/heartbeat_tapa.png" width=350/> <img src="figs/heartbeat_kor25ce.png" width=350/>
|
||||
<img src="./figs/emergent.png" width="800"/>
|
||||
</p>
|
||||
|
||||
📌 For detailed experimental results and comprehensive analysis, please refer to our [technical report](https://arxiv.org/pdf/2508.08636) 📝.
|
||||
|
||||
## ⚫ DEMO: [InternThinker-GO](https://intern.openxlab.org.cn/internthinker/go-game)
|
||||
|
||||
LLMs have demonstrated remarkable performance across a wide range of common reasoning tasks. However, as one of the earliest research problems that ignited the AI boom, the reasoning capabilities of general-purpose LLMs in the specific domain of **Go** have received little research attention. While AlphaZero challenged human intelligence in the Go domain from the perspective of "Mastering the Game of Go without Human Knowledge," we explore how to bring human intelligence back to this ancient game, allowing the natural language thinking patterns unique to humans to shine again in the new context of LLMs. Based on InternBootcamp, we implemented a Go bootcamp for reinforcement learning of reasoning models, cold-started with professional Go domain data, and reinforced the model's reasoning paradigm through reinforcement learning. Our model achieves performance comparable to professional Go players - InternThinker-GO can consistently defeat Golaxy AI at amateur 6-dan level and approaches the professional 1-star level, making it the first general large language model to reach this level of performance.
|
||||
|
||||
<p align="center">
|
||||
<img src="figs/demo_internthinkerGO.png" width="1000"/>
|
||||
</p>
|
||||
|
||||
For a given state, InternThinker-GO first analyzes the situation on the board: *"There is a complex battle area in the upper right corner, where both Black and White have multiple stones. The lower left corner has some interlaced Black and White stones, forming a certain structure. Black has a formation along the left edge. Black just played move 65 at B14, which is clearly intended to invade White's territory on the left side. As White, I need to respond carefully to this threat."* Next, InternThinker-GO specifically predicts and analyzes potential moves such as B13, C13, and ultimately selects B13 as the placement position.
|
||||
|
||||
## 🙏 致谢
|
||||
|
||||
We acknowledge the following excellent projects, whose work has provided significant inspiration and tooling for our efforts:
|
||||
|
||||
- [Intern-S1](https://github.com/InternLM/Intern-S1)
|
||||
- [VeRL](https://github.com/volcengine/verl)
|
||||
- [Xtuner](https://github.com/InternLM/xtuner)
|
||||
- [OpenCompass](https://github.com/open-compass/opencompass)
|
||||
|
||||
|
||||
## Citation
|
||||
## 📜 Citation
|
||||
|
||||
If you find our library helpful, please cite
|
||||
```
|
||||
@software{Internbootcamp,
|
||||
author = {Internbootcamp Team},
|
||||
title = {Internbootcamp},
|
||||
version = {0.0.4},
|
||||
month = {4},
|
||||
year = {2025}
|
||||
If you find our work helpful, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{li2025internbootcamptechnicalreportboosting,
|
||||
title={InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling},
|
||||
author={Peiji Li and Jiasheng Ye and Yongkang Chen and Yichuan Ma and Zijie Yu and Kedi Chen and Ganqu Cui and Haozhan Li and Jiacheng Chen and Chengqi Lyu and Wenwei Zhang and Linyang Li and Qipeng Guo and Dahua Lin and Bowen Zhou and Kai Chen},
|
||||
year={2025},
|
||||
eprint={2508.08636},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2508.08636},
|
||||
}
|
||||
```
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue