mirror of https://github.com/InternLM/InternBootcamp.git synced 2026-04-19 12:58:04 +00:00

History

Yongkang Chen a8249acc18 update to tech report version (#10 ) * feat(run_eval): add checkpoint resume functionality and update example documentation; - update new bootcamp benchmark dataset * refactor(data_pipeline): optimize data generation pipeline; add multiple preset configurations for data generation * docs: update bootcamp list and add new scripts - Update Fulllist_InternBootcamp.md with new bootcamps and categories - Add new scripts to .gitignore: - examples/pipelines/filter_autogen_configs.py - examples/pipelines/quickgen_data_configs_from_eval_meta.py - Update dependencies in setup.py: - Add scipy and scikit-learn * refactor(internbootcamp): update bootcamp modules and improve error handling - Update import statements in __init__.py files - Add timestamp to target directory name in verl_data_preprocess.py - Improve error handling and scoring logic in bootcamp_judger.py - Remove unnecessary comments and update puzzle descriptions in multiple files		2025-08-28 12:39:47 +08:00
..
all_configs	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
autogen_configs	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
data_configs	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
puzzle_configs	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
cipher_data_generator.py	pjli-dev	2025-06-12 12:45:31 +08:00
data_generator.py	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
quickgen_data_configs.py	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00
README.md	init-commit	2025-05-23 15:27:15 +08:00
README_zh.md	init-commit	2025-05-23 15:27:15 +08:00
run_pipeline.sh	update to tech report version (#10 )	2025-08-28 12:39:47 +08:00

README.md

Pipeline Usage

Configuration files

puzzle_configs: you can configure the parameters for __init__ a bootcamp. Different parameters lead to different distribution of the generated samples.

data_configs: configuration files to run the final generation pipeline.

You can manually add the tasks you want to generate data for in the file.
You can use examples/pipelines/puzzle_configs/ to run examples/pipelines/data_config_gen.py. This will automatically generate data_config_train.jsonl and data_config_test.jsonl under data_configs.

For example, an example to include futoshiki is as follows.

{"bootcamp_name": "futoshiki", "sample_number": 100, "config_file": "futoshiki", "bootcamp_cls_name": "Futoshikibootcamp"}

Here, sample_number means the number of data samples to generate, config_file the name of the task configuration file, and bootcamp_cls_name represent the class name of the bootcamp used to generate data.

Running the Data Generation Pipeline

run_pipeline.sh contains the unified pipeline to generate data for all tasks based on the configurations.

Quick Start

Run the following command to gather all the bootcamp into a configuration file to specify options for data generation..

python examples/pipelines/quickgen_data_configs.py

You can adjust the train_sample_number and test_sample_number to control the number to samples to generate for the two sets.

Run bash examples/pipelines/run_pipline.sh to generate data with the output under examples/bootcamp_generator_outputs.