diff --git a/README.md b/README.md index f3511e4..abff164 100755 --- a/README.md +++ b/README.md @@ -12,6 +12,7 @@ Internbootcamp is an easy-to-use and extensible library of bootcamp environments ## Getting Started * [Installation](#section3) +* [Quick Start](examples/get_started.md) * [Interfaces and Usages](#section4) diff --git a/README_zh.md b/README_zh.md index 5a56545..82f3fd9 100755 --- a/README_zh.md +++ b/README_zh.md @@ -13,6 +13,7 @@ Internbootcamp是一个易用、可扩展的推理模型训练训练场环境。 ## Getting Started * [安装](#section3) +* [快速开始](examples/get_started_zh.md) * [接口及用法](#section4) diff --git a/examples/get_started.md b/examples/get_started.md new file mode 100644 index 0000000..6bb48df --- /dev/null +++ b/examples/get_started.md @@ -0,0 +1,74 @@ +# Quick Start + +**InternBootcamp** provides functionalities such as data generation, model training, and model evaluation. You can refer to the following guide to get started quickly. + +To ensure successful execution of the following steps, please make sure **InternBootcamp is installed**, and **the project root directory is set as your working directory**. + +## Data Generation + +Run [**run\_pipeline.sh**](examples/pipelines/run_pipeline.sh) to generate training and testing data based on the [default configuration](examples/pipelines/data_configs). +For custom configurations, please refer to the [Pipeline Usage Guide](examples/pipelines/README.md) for personalized setup. + +```bash +source examples/pipelines/run_pipeline.sh +``` + +The generated data will be saved in the [bootcamp\_generator\_outputs directory](examples/bootcamp_generator_outputs). Each data batch is timestamped, and the directory structure is as follows: + +``` +examples/ +├── ... +└──bootcamp_generator_outputs/ + ├── ... + └── 2025-xx-xx-xx:xx:xx/ + ├── test/ + │ ├── bootcamp_0.jsonl + │ ├── ... + │ └── bootcamp_n.jsonl + └── train/ + ├── bootcamp_0.jsonl + ├── ... + └── bootcamp_n.jsonl +``` + +## Model Training(Reinforcement learning) + +We provide support for two training frameworks: **Xpuyu** and **Verl**. + +### Xpuyu + +Refer to the [Xpuyu documentation](examples/xpuyu_usage/README.md) to get started with efficient training. + +### Verl + +To integrate Bootcamp tasks into the Verl framework for training, you need to embed the Bootcamp reward computation method into Verl. +See the [Verl documentation](examples/verl_usage/README.md) for detailed guidance. + +## Model Evaluation + +For Bootcamp tasks, we offer a customized evaluation service. +Once the model to be evaluated is deployed using frameworks like FastChat or Ollama, and you have the corresponding API URL and API Key, you can run the following command to evaluate your deployed model on the **InternBootcamp\_eval** dataset: + +```bash +cd InternBootcamp +python examples/unittests/run_eval.py \ + --url http://127.0.0.1:8000/v1 \ + --api_key EMPTY \ + --model_name r1_32B \ + --api_mode completion \ + --template r1 \ + --max_tokens 32768 \ + --temperature 0 \ + --test_dir examples/data/InternBootcamp_eval \ + --max_concurrent_requests 128 \ + --timeout 6000 \ + --max_retries 16 \ + --max_retrying_delay 60 +``` + +**Note:** When `api_mode` is set to `completion`, be sure to correctly specify the corresponding template (supported: `r1`, `qwen`, `internthinker`, `chatml` (with no system prompt)). +For more detailed instructions, refer to the [Evaluation Manual](examples/unittests/README.md). + +--- + +Let me know if you’d like a version with clearer formatting for publishing or documentation. diff --git a/examples/get_started_zh.md b/examples/get_started_zh.md new file mode 100644 index 0000000..b792a3f --- /dev/null +++ b/examples/get_started_zh.md @@ -0,0 +1,52 @@ +# 快速开始 +InternBootcamp提供了数据生成、模型训练、模型评测等功能,可参考如下指南快速入门。 +为确保后续操作成功执行,请确保您已经安装了InternBootcamp,并将项目根目录设为工作目录。 +## 数据生成 + +运行[**run_pipeline.sh**](examples/pipelines/run_pipeline.sh)即可根据[默认配置](examples/pipelines/data_configs)生成对应的测试与训练数据。若有自定义配置等需求可参考[Pipeline Usage](examples/pipelines/README_zh.md)进行个性化配置。 +```bash +source examples/pipelines/run_pipeline.sh +``` +生成的数据保存于[bootcamp_generator_outputs目录](examples/bootcamp_generator_outputs),数据批次以时间戳命名,具体目录结构如下 +``` +examples/ +├── ... +└──bootcamp_generator_outputs/ + ├── ... + └── 2025-xx-xx-xx:xx:xx/ + ├── test/ + │ ├── bootcamp_0.jsonl + │ ├── ... + │ └── bootcamp_n.jsonl + └── train/ + ├── bootcamp_0.jsonl + ├── ... + └── bootcamp_n.jsonl +``` + +## 模型训练 +我们提供了两种训练框架(Xpuyu,Verl)的支持体系。 +### Xpuyu +可参考[Xpuyu说明文档](examples/xpuyu_usage/README_zh.md)快速进行高效训练。 +### Verl +要在Verl框架下加入Bootcamp任务进行训练,需要参考说明文档将bootcamp奖励计算方法嵌入verl框架,具体参考[Verl说明文档](examples/verl_usage/README_zh.md)进行。 + +## 模型评测 +我们针对bootcamp任务,提供了个性化的评测服务,在使用FastChat、Ollama等框架部署好待测模型并获取对应API Url 与API Key后,使用如下命令可快速评测已部署的模型在**InternBootcamp_eval**评测集上的性能: +```bash +cd InternBootcamp +python examples/unittests/run_eval.py \ + --url http://127.0.0.1:8000/v1 \ + --api_key EMPTY \ + --model_name r1_32B \ + --api_mode completion \ + --template r1 \ + --max_tokens 32768 \ + --temperature 0 \ + --test_dir examples/data/InternBootcamp_eval \ + --max_concurrent_requests 128 \ + --timeout 6000 \ + --max_retries 16 \ + --max_retrying_delay 60 +``` +注意当api_mode指定为completion时,需正确设置对应的template(支持r1、qwen、internthinker、chatml(with no system prompt))。更详细的内容参考[评测手册](examples/unittests/README_zh.md)。 \ No newline at end of file diff --git a/examples/pipelines/data_configs/data_config_train.jsonl b/examples/pipelines/data_configs/data_config_train.jsonl index ad231ed..00821a0 100644 --- a/examples/pipelines/data_configs/data_config_train.jsonl +++ b/examples/pipelines/data_configs/data_config_train.jsonl @@ -1 +1 @@ -{"bootcamp_name": "medcalculator", "sample_number": 100000, "config_file": "med_calculator", "bootcamp_cls_name": "Medcalculatorbootcamp"} \ No newline at end of file +{"bootcamp_name": "medcalculator", "sample_number": 100000, "config_file": "med_calculator", "bootcamp_cls_name": "Medcalculatorbootcamp"} diff --git a/examples/pipelines/puzzle_configs/BBEHHyperbaton_test.json b/examples/pipelines/puzzle_configs/BBEHHyperbaton_test.json new file mode 100644 index 0000000..1610ea1 --- /dev/null +++ b/examples/pipelines/puzzle_configs/BBEHHyperbaton_test.json @@ -0,0 +1,3 @@ +[ + +] \ No newline at end of file diff --git a/examples/pipelines/puzzle_configs/BBEHHyperbaton_train.json b/examples/pipelines/puzzle_configs/BBEHHyperbaton_train.json new file mode 100644 index 0000000..1610ea1 --- /dev/null +++ b/examples/pipelines/puzzle_configs/BBEHHyperbaton_train.json @@ -0,0 +1,3 @@ +[ + +] \ No newline at end of file diff --git a/examples/xpuyu_usage/README_zh.md b/examples/xpuyu_usage/README_zh.md new file mode 100644 index 0000000..66069c1 --- /dev/null +++ b/examples/xpuyu_usage/README_zh.md @@ -0,0 +1,80 @@ +# 使用 Xtuner 进行 Bootcamp 训练 + +## 🚄 训练教程 + +### 1. 安装依赖项 + +我们使用 [XTuner](https://github.com/InternLM/xtuner/tree/main) 作为训练引擎。 + +您需要确保 InternBootcamp 已成功安装。 + +```bash +pip install -e $InternBootcamp_path +``` + +然后安装 xtuner 及其依赖项。 + +```bash +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 +pip install flash-attn --no-build-isolation +pip install xtuner[all]==0.2.0rc0 +``` + +### 2. 准备数据 + +可以通过 `examples/xpuyu_usage/xpuyu_data_preprocess.py` 将 bootcamp 数据转换为训练格式。 + +**示例用法:** +```python +python examples/xpuyu_usage/xpuyu_preprocess.py --src examples/bootcamp_generator_outputs/{%Y-%m-%d-%H:%M:%S} +``` + +### 3. 准备您的训练配置 + +准备您的训练配置以开始 GRPO 训练。 + +一个示例配置文件位于: + +```bash +examples/xpuyu_usage/bootcamp_rl/configs/example_training_config.py +``` + +### 4. 开始训练 + +```bash +cd examples/xpuyu_usage + +GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())') + +# GPU 工作节点的数量,如果是单工作节点训练,请设置为 1 +NNODES=${WORLD_SIZE:-1} # 修改以适应集群环境 + +# 当前工作节点的编号,应为 {0, ..., WORKER_CNT-1} 中的值,如果是单工作节点训练,请设置为 0 +NODE_RANK=${RANK:-0} # 修改以适应集群环境 + +# Rank-0 工作节点的 IP 地址,如果是单工作节点训练,请设置为 localhost +MASTER_ADDR=${MASTER_ADDR:-localhost} + +# 通信端口 +MASTER_PORT=${MASTER_PORT:-6001} + +DISTRIBUTED_ARGS=" + --nproc_per_node $GPUS_PER_NODE \ + --nnodes $NNODES \ + --node_rank $NODE_RANK \ + --master_addr $MASTER_ADDR \ + --master_port $MASTER_PORT +" + +echo $DISTRIBUTED_ARGS + +torchrun $DISTRIBUTED_ARGS train_grpo.py ./bootcamp_rl/configs/example_training_config.py --work_dir examples/xpuyu_usage/ckpts/experiment_name +``` + +### 5. 训练曲线可视化 + +您可以使用 `examples/xpuyu_usage/report_to_wandb.py` 来可视化训练曲线。 + +```bash +python examples/xpuyu_usage/report_to_wandb.py examples/xpuyu_usage/ckpts/{experiment_name}/{timestamp}/rank0.log.jsonl {wandb_project_name} +``` \ No newline at end of file diff --git a/internbootcamp/bootcamp/__init__.py b/internbootcamp/bootcamp/__init__.py index 7922456..4de2647 100755 --- a/internbootcamp/bootcamp/__init__.py +++ b/internbootcamp/bootcamp/__init__.py @@ -26,7 +26,7 @@ from .bbeh_multistep_arithmetic.multistep_arithmetic_default import BbehMultiste from .bbeh_multistep_arithmetic.multistep_arithmetic import BBEHMultistepArithmeticV2bootcamp from .bbeh_shuff_object.shuff_object import Bbehshuffobjectbootcamp from .bbeh_dyck.bbeh_dyck_languages_InternBootcampDEV import BbehDyckLanguagesbootcamp - +from .bbeh_hyperbaton.hyperbaton_default import BBEHHyperbatonbootcamp from .bbeh_boardgame_qa.bbeh_boardgame_qa import Bbehboardgameqabootcamp from .bbeh_boolean_expressions.bbeh_boolean_expressions import Bbehbooleanexpressionsbootcamp diff --git a/internbootcamp/bootcamp/bbeh_hyperbaton/hyperbaton_default.py b/internbootcamp/bootcamp/bbeh_hyperbaton/hyperbaton_default.py new file mode 100644 index 0000000..fae18f7 --- /dev/null +++ b/internbootcamp/bootcamp/bbeh_hyperbaton/hyperbaton_default.py @@ -0,0 +1,131 @@ +import random +import json +import re +import os +import sys +import traceback +from typing import Dict, Any, List, Union +from internbootcamp.bootcamp.base import Basebootcamp +from internbootcamp.libs.bbeh_hyperbaton.bbeh_hyperbaton_generator import HyperbatonGenerator +from internbootcamp.libs.bbeh_hyperbaton.bbeh_hyperbaton_solver import HyperbatonSolver +from internbootcamp.libs.bbeh_hyperbaton.bbeh_hyperbaton_validor import HyperbatonValidator + + +class BBEHHyperbatonbootcamp(Basebootcamp): + _generator = HyperbatonGenerator() + _solver = HyperbatonSolver() + _validator = HyperbatonValidator() + + def __init__(self): + + self.generator = self._generator + self.solver = self._solver + self.validator = self._validator + + @classmethod + def case_generator(cls) -> Dict[str, Any]: + if cls._generator is None: + raise RuntimeError("Generator not initialized. Create an instance of BBEHHyperbatonbootcamp first.") + return cls._generator.generate_task() + + @classmethod + def extract_output(cls,output: str) -> Union[str, None]: + try: + if not output or output.strip() == "" or "null" in output.lower(): + return None + + # 查找最终答案部分 + answer_match = re.search(r'最终答案:\s*([A-K]+)(?:\n|$)', output, re.MULTILINE) + if answer_match: + answer_text = answer_match.group(1).strip().strip('"\'') + # 验证答案格式是否正确(只包含A-K的字母) + if re.match(r'^[A-K]+$', answer_text): + return answer_text + return None + + except Exception as e: + return None + + @classmethod + def _verify_correction(cls, answer: Any, identity: Dict[str, Any]) -> bool: + if cls._solver is None or cls._validator is None: + raise RuntimeError( + "Solver or Validator not initialized. Create an instance of BBEHHyperbatonbootcamp first.") + + test_case = { + "input": identity["input"], + "target": identity["target"] + } + + # 使用验证器验证答案 + validation_result = cls._validator.validate_batch([test_case], [answer]) + return validation_result["detailed_results"][0]["is_correct"] + + @staticmethod + def prompt_func(identity: Dict[str, Any]) -> str: + question = identity['input'] + prompt = f"""你是一个擅长分析英语形容词顺序的AI助手。请解决以下形容词顺序问题: + +问题: +{question} + +请仔细分析示例中的形容词顺序规律,并判断哪些选项符合这个规律。 + +请按以下格式输出你的答案: +最终答案: [你的答案] + +注意:答案应该是由正确选项的字母组成的字符串(如"ABC"),如果没有正确选项则输出"K"。 +""" + return prompt + + +if __name__ == '__main__': + # try: + # # 抑制其他模块的输出 + # original_stdout = sys.stdout + # sys.stdout = open(os.devnull, 'w') + + # current_dir = os.path.dirname(os.path.abspath(__file__)) + # intern_bootcamp_dir = os.path.abspath(os.path.join(current_dir, '..')) + + # sys.path.append(intern_bootcamp_dir) + + # bootcamp = BBEHHyperbatonbootcamp() + # identity = BBEHHyperbatonbootcamp.case_generator() + + # if identity is None: + # sys.stdout = original_stdout + # print("Error: Failed to generate test case") + # exit(1) + + # correct_solution = bootcamp.solver.solve_puzzle(identity["input"]) + # extracted_answer = BBEHHyperbatonbootcamp.extract_output(f"最终答案: {correct_solution}") + # is_correct = BBEHHyperbatonbootcamp._verify_correction(extracted_answer, identity) + # score = bootcamp.verify_score(f"最终答案: {correct_solution}", identity, short_penalty=False, format_penalty=False) + # sys.stdout = original_stdout + + # # 简洁输出 + # print(f"问题: {identity['input']}") + # print(f"答案: {correct_solution}") + # print(f"验证: {'✓' if is_correct else '✗'}") + # print(f"得分: {score}") + + # except Exception as e: + # if 'original_stdout' in locals(): + # sys.stdout = original_stdout + # print(f"错误: {str(e)}") + case = { + "id": 5, + "prompt": "你是一个擅长分析英语形容词顺序的AI助手。请解决以下形容词顺序问题:\n\n问题:\nIn a variant of English, we are given that the following sentences have correct adjective order:\nbrand-new good huge computer old-fashioned mysterious enormous square violet box new repulsive tiny knife new ridiculous small pyramidal teal plate old-fashioned repulsive midsize circular shoe old wonderful medium-size canvas brand-new silly huge rectangular knife old silly small watch ancient awful tiny rectangular white bicycle old-fashioned awful little spherical pink bicycle new awful massive book archaic repulsive normal-size prismlike blue flower pot old-fashioned ridiculous large prismlike bottle archaic lovely massive prismlike piano archaic awful small fire extinguisher new ridiculous normal-size prismlike crimson pen brand-new awful large spoon brand-new awful massive bottle old repulsive huge triangular white hammer ancient silly big prismlike maroon lamp new good big saw brand-new wonderful massive spherical hammer old nice enormous bird archaic silly enormous key archaic mysterious large screwdriver ancient wonderful midsize prismlike bicycle old good enormous ruler archaic repulsive medium-size square cyan box new good little prismlike teal canvas archaic nice massive square pink speaker old-fashioned nice midsize banana brand-new nice large pyramidal banana ancient mysterious massive guitar brand-new wonderful massive prismlike calculator ancient awful little bed old repulsive huge square black sunglasses old-fashioned obnoxious little square camera brand-new repulsive little square gray piano old-fashioned awful large triangular gray clock old-fashioned terrible midsize screwdriver brand-new wonderful medium-size rectangular speaker brand-new nice big square indigo box archaic awful massive rectangular magenta pencil brand-new repulsive midsize spherical green fire extinguisher ancient ridiculous big triangular green stapler new good medium-size square wrench archaic good massive prismlike white toolbox ancient wonderful midsize spherical aqua piano archaic repulsive enormous prismlike piano archaic awful massive triangular necklace ancient ridiculous big square bicycle old good huge rectangular key\n\nIn this variant of English, which of the following sentences (Options A-J) use the correct adjective order? If none of the sentences (Options A-J) use the correct adjective order, select option K. Select all that apply.\n(A) indigo spherical brand-new lamp (B) archaic ridiculous huge table (C) nice whittling small archaic violet iron square screwdriver (D) old awful big drill (E) square little pink ring (F) old-fashioned mysterious midsize rectangular yellow leather speaker (G) drinking indigo cloth ancient spherical repulsive pen (H) archaic terrible little circular black iron american box (I) ancient lovely normal-size pyramidal cyan rubber chinese hammer (J) old-fashioned awful large prismlike indigo wood congolese stapler\n(K) None of the above\n\nProvide your final answer as a concatenation of all the correct choices. For example, if B and C have correct adjective order, then your final answer must be \"BC\".\n\n请仔细分析示例中的形容词顺序规律,并判断哪些选项符合这个规律。\n\n请按以下格式输出你的答案:\n最终答案: [你的答案]\n\n注意:答案应该是由正确选项的字母组成的字符串(如\"ABC\"),如果没有正确选项则输出\"K\"。", + "output_len": 386, + "score": 0, + "extracted_output": "BDFHIJ", + "ground_truth": { + "input": "In a variant of English, we are given that the following sentences have correct adjective order:\nbrand-new good huge computer old-fashioned mysterious enormous square violet box new repulsive tiny knife new ridiculous small pyramidal teal plate old-fashioned repulsive midsize circular shoe old wonderful medium-size canvas brand-new silly huge rectangular knife old silly small watch ancient awful tiny rectangular white bicycle old-fashioned awful little spherical pink bicycle new awful massive book archaic repulsive normal-size prismlike blue flower pot old-fashioned ridiculous large prismlike bottle archaic lovely massive prismlike piano archaic awful small fire extinguisher new ridiculous normal-size prismlike crimson pen brand-new awful large spoon brand-new awful massive bottle old repulsive huge triangular white hammer ancient silly big prismlike maroon lamp new good big saw brand-new wonderful massive spherical hammer old nice enormous bird archaic silly enormous key archaic mysterious large screwdriver ancient wonderful midsize prismlike bicycle old good enormous ruler archaic repulsive medium-size square cyan box new good little prismlike teal canvas archaic nice massive square pink speaker old-fashioned nice midsize banana brand-new nice large pyramidal banana ancient mysterious massive guitar brand-new wonderful massive prismlike calculator ancient awful little bed old repulsive huge square black sunglasses old-fashioned obnoxious little square camera brand-new repulsive little square gray piano old-fashioned awful large triangular gray clock old-fashioned terrible midsize screwdriver brand-new wonderful medium-size rectangular speaker brand-new nice big square indigo box archaic awful massive rectangular magenta pencil brand-new repulsive midsize spherical green fire extinguisher ancient ridiculous big triangular green stapler new good medium-size square wrench archaic good massive prismlike white toolbox ancient wonderful midsize spherical aqua piano archaic repulsive enormous prismlike piano archaic awful massive triangular necklace ancient ridiculous big square bicycle old good huge rectangular key\n\nIn this variant of English, which of the following sentences (Options A-J) use the correct adjective order? If none of the sentences (Options A-J) use the correct adjective order, select option K. Select all that apply.\n(A) indigo spherical brand-new lamp (B) archaic ridiculous huge table (C) nice whittling small archaic violet iron square screwdriver (D) old awful big drill (E) square little pink ring (F) old-fashioned mysterious midsize rectangular yellow leather speaker (G) drinking indigo cloth ancient spherical repulsive pen (H) archaic terrible little circular black iron american box (I) ancient lovely normal-size pyramidal cyan rubber chinese hammer (J) old-fashioned awful large prismlike indigo wood congolese stapler\n(K) None of the above\n\nProvide your final answer as a concatenation of all the correct choices. For example, if B and C have correct adjective order, then your final answer must be \"BC\".", + "target": "BDFHIJ" + }, + "output": "To determine which options have the correct adjective order, we need to analyze the given sentences and extract the adjective order rules for this variant of English. Here's the observed order based on the examples:\n\n1. Age/Time (brand-new, new, old, old-fashioned, ancient, archaic)\n2. Opinion (good, repulsive, ridiculous, awful, lovely, mysterious, nice, obnoxious, terrible)\n3. Size (tiny, small, little, midsize, medium-size, normal-size, big, large, huge, enormous, massive)\n4. Shape (square, circular, rectangular, pyramidal, prismlike, spherical, triangular)\n5. Color (violet, teal, crimson, maroon, cyan, pink, indigo, magenta, green, aqua, black, white, gray, yellow)\n6. Material/Origin (canvas, leather, iron, wood, rubber, american, chinese, congolese)\n7. Purpose (drinking, whittling) - appears right before the noun\n\nNow let's evaluate each option:\n\n(A) indigo spherical brand-new lamp \n- Incorrect: \"brand-new\" (age) should come before \"spherical\" (shape) and \"indigo\" (color). \nCorrect order: brand-new spherical indigo lamp \n\n(B) archaic ridiculous huge table \n- Correct: archaic (age) → ridiculous (opinion) → huge (size) \n\n(C) nice whittling small archaic violet iron square screwdriver \n- Incorrect: \"archaic\" (age) should come first, \"whittling\" (purpose) should be right before noun. \nCorrect order: archaic nice small square violet iron whittling screwdriver \n\n(D) old awful big drill \n- Correct: old (age) → awful (opinion) → big (size) \n\n(E) square little pink ring \n- Incorrect: \"little\" (size) should come before \"square\" (shape). \nCorrect order: little square pink ring \n\n(F) old-fashioned mysterious midsize rectangular yellow leather speaker \n- Correct: old-fashioned (age) → mysterious (opinion) → midsize (size) → rectangular (shape) → yellow (color) → leather (material) \n\n(G) drinking indigo cloth ancient spherical repulsive pen \n- Incorrect: \"ancient\" (age) should come first, \"drinking\" (purpose) should be right before noun. \nCorrect order: ancient repulsive spherical indigo cloth drinking pen \n\n(H) archaic terrible little circular black iron american box \n- Correct: archaic (age) → terrible (opinion) → little (size) → circular (shape) → black (color) → iron (material) → american (origin) \n\n(I) ancient lovely normal-size pyramidal cyan rubber chinese hammer \n- Correct: ancient (age) → lovely (opinion) → normal-size (size) → pyramidal (shape) → cyan (color) → rubber (material) → chinese (origin) \n\n(J) old-fashioned awful large prismlike indigo wood congolese stapler \n- Correct: old-fashioned (age) → awful (opinion) → large (size) → prismlike (shape) → indigo (color) → wood (material) → congolese (origin) \n\nThe correct options are B, D, F, H, I, J.\n\n最终答案: BDFHIJ" +} + bootcamp = BBEHHyperbatonbootcamp + score = bootcamp.verify_score(case["output"], case["ground_truth"], short_penalty=False, format_penalty=False) + print(f"Score: {score}") \ No newline at end of file diff --git a/internbootcamp/bootcamp/symbolic_regression/symbolic_regression.py b/internbootcamp/bootcamp/symbolic_regression/symbolic_regression.py new file mode 100644 index 0000000..4c31c12 --- /dev/null +++ b/internbootcamp/bootcamp/symbolic_regression/symbolic_regression.py @@ -0,0 +1,302 @@ +import re +import json +import requests +from internbootcamp.bootcamp.base import Basebootcamp +from sklearn.metrics import r2_score, root_mean_squared_error +import numpy as np +import sympy as sp +import pickle + + +class SymblocRegression(Basebootcamp): + def __init__(self, data_path): + super().__init__() + self.data_path = data_path + + def case_generator(self, sample_num=300) -> object: + """ + 生成一组数字和目标值。 + """ + + with open(f'{self.data_path}', 'rb') as f: + formula_data = pickle.load(f) + data_list = [] + for i in range(len(formula_data)): + true_formula = formula_data[i]['formula'] + dataset = formula_data[i]['data'] + rand_idx = np.random.choice(dataset.shape[0], sample_num, replace=False) + dataset = dataset[rand_idx] + data_list.append({ + 'id': formula_data[i]['id'], + 'true_formula': true_formula, + 'data':dataset, + }) + return data_list + + def prompt_func(self, identity) -> str: + """ + Process the input_data and return the processed prompt. + + Args: + question_ori: The question to be processed. + + Returns: + str: The processed prompt. + """ + data = identity['data'] + length_data = data.shape[0] + split_idx = int(length_data * 0.97) + prompt = f"""You will be provided with a set of input-output pairs. Based on these data, infer the mathematical relationship between y and multiple input variables. Please note that the possible mathematical operations include: +, -, *, /, exp, sqrt, sin, arcsin, and constant terms. The input sample data are as follows: {change_data_to_prompt(data[:split_idx, :])} Based on the above data, please infer the possible formula. Ensure that your inference applies to all the provided data points, and consider both linear and nonlinear combinations. Verify whether your formula applies to the following new data point and adjust it to ensure accuracy: {change_data_to_prompt(data[split_idx:, :])} Finally, please output only the formula string you inferred (e.g. z=x_0 * x_1), without any additional information.""" + return prompt + + @staticmethod + def extract_output(output): + """ + Extract the output from the solution. + + Args: + output: Model output to be processed. + + Returns: + The processed output. + """ + infer_formula = llm_translate(output, mllm='gpt-4o') # gpt-4o Qwen2.5-vl-72b + infer_formula = clean_formula_string(infer_formula) + return infer_formula + + @classmethod + def _verify_correction(self, infer_formula, gt_case, mllm='gpt-4o')->bool: + """ + Verify the correction of the solution. + """ + gt_formula = gt_case['true_formula'] + data = gt_case['data'] + metrics = { + 'LLM_Score': None, + 'RMSE': None, + 'NMSE': None, # 新增:Normalized MSE + 'SymbolicMatch': False, + 'R2': -100000.0, + } + + # 结构评分(用 LLM) + metrics['LLM_Score'] = llm_evaluate(infer_formula, gt_formula, mllm=mllm) + + # 数值拟合 + func_pred, variable_names = parse_formula(infer_formula) + func_gt, variable_names = parse_formula(gt_formula) + var_num = len(variable_names) + x, y_true = data[:, :var_num], data[:, -1] + if func_pred is not None: + try: + x_vars = [x[:, i] for i in range(var_num)] + y_pred = func_pred(*x_vars) + if np.isscalar(y_pred): + y_pred = np.full_like(y_true, y_pred) + #################################################### + valid_mask = np.isfinite(y_true) & np.isfinite(y_pred) + y_true, y_pred = y_true[valid_mask], y_pred[valid_mask] + #################################################### + metrics['RMSE'] = root_mean_squared_error(y_true, y_pred) + metrics['R2'] = r2_score(y_true, y_pred) + metrics['NMSE'] = np.mean((y_true - y_pred) ** 2) / np.var(y_true) + except Exception as e: + print(f"Exception: {e}") + try: + x0_vals, x1_vals = generate_samples() + gt_vals = func_gt(x0_vals, x1_vals) + pred_vals = func_pred(x0_vals, x1_vals) + + # 去除非法值(NaN 或 inf) + valid_mask = np.isfinite(gt_vals) & np.isfinite(pred_vals) + gt_valid = gt_vals[valid_mask] + pred_valid = pred_vals[valid_mask] + + # 计算 RMSE 值 + metrics['RMSE'] = np.sqrt(np.mean((gt_valid - pred_valid) ** 2)) + # 计算 R2 值 + metrics['R2'] = 1 - np.sum((gt_valid - pred_valid) ** 2) / np.var(gt_valid) + metrics['NMSE'] = np.mean((gt_valid - pred_valid) ** 2) / np.var(gt_valid) + except Exception as e: + print(e) + # 判断方程等价性 + metrics['SymbolicMatch'] = is_symbolically_equivalent(infer_formula, gt_formula, var_num) + + return metrics + + +def _send_request(messages, mllm='gpt-4o'): + URL = f"" # TODO your API URL + API_KEY = "" # TODO your API key + if URL is None or API_KEY is None: + raise ValueError("Please provide your API URL or API key.") + HEADERS = { + 'Accept': 'application/json', + 'Authorization': f'Bearer {API_KEY}', + 'User-Agent': 'Apifox/1.0.0 (https://apifox.com)', + 'Content-Type': 'application/json' + } + MLLM_claudeshop = { + 'gpt-4o': 'chatgpt-4o-latest', + } + model = MLLM_claudeshop[mllm] + count = 0 + while True and count < 20: + count += 1 + payload = json.dumps({ + "model": model, + "messages": messages, + "temperature": 0.6, + "max_tokens": 1024 + }) + session = requests.Session() + session.keep_alive = False + response = session.post(URL, headers=HEADERS, data=payload, verify=True) + try: + content = response.json()['choices'][0]['message']['content'] + break + except Exception as e: + print(f"Error: {e}, {response.json()}") + pass + return content + + +def clean_formula_string(formula_str): + # 1. 删除 Markdown 残留符号 + formula_str = formula_str.replace('×', '*').replace('·', '*').replace('÷', '/') + formula_str = formula_str.replace('−', '-').replace('^', '**') + formula_str = formula_str.replace('“', '"').replace('”', '"').replace('’', "'") + + # 2. 去除 markdown 反引号 ``` 和 $ 符号 + formula_str = formula_str.replace('`', '').replace('$', '').strip() + + # 3. 提取第一行公式(防止有多行解释性输出) + formula_str = formula_str.split('\n')[0].strip() + + # 4. 用正则去除非合法字符(保留基本数学表达式) + formula_str = re.sub(r'[^\w\s\+\-\*/\^\=\.\(\)]', '', formula_str) + + # 5. 确保左右去空格 + return formula_str.strip() + + +def llm_translate(dirty_formula, mllm='gpt-4o'): + content = f''' + This is a language model's judgment on a mathematical formula. Please help me extract the mathematical formula from this judgment and return it: + {dirty_formula} + Please serve pi as pi and use x0, x1, x2,... to represent the variable names. + ONLY RETURN THE FORMULA STRING (Not LATEX). + ''' + messages = [{"role": "user", "content": content}] + clean_formula = _send_request(messages, mllm=mllm) + return clean_formula + + +def llm_evaluate(inferred_formula, true_formula, mllm='gpt-4o'): + content = f''' + You are given two mathematical formulas. Your task is to evaluate how structurally similar they are, and return a similarity score between 0 and 1. + + The score should reflect how closely the formulas match in terms of: + - Mathematical operations and structure (e.g., same use of +, *, sin, etc.) + - Term arrangement and complexity + - Overall symbolic expression and intent + + A score of: + - 1 means the formulas are structurally identical or mathematically equivalent + - Around 0.8-0.9 means they are very similar but not identical + - Around 0.5 means moderately similar (e.g., same overall shape but different terms) + - Near 0 means structurally unrelated formulas + + Do not consider numerical evaluation or specific input values — only the symbolic structure and mathematical form. + + Formulas: + Inferred Formula: {inferred_formula} + True Formula: {true_formula} + + ONLY RETURN [THE SIMILARITY SCORE] + ''' + messages = [{"role": "user", "content": content}] + similarity_score = _send_request(messages, mllm=mllm) + return similarity_score[-4:] + + +def is_symbolically_equivalent(formula1, formula2, n_var=2): + try: + x = [sp.Symbol(f'x{i}') for i in range(n_var)] + + expr1 = sp.sympify(formula1.split('=')[1] if '=' in formula1 else formula1) + expr2 = sp.sympify(formula2.split('=')[1] if '=' in formula2 else formula2) + + return sp.simplify(expr1 - expr2) == 0 + except Exception: + return False + + +def parse_formula(formula_str: str): + try: + if '=' in formula_str: + expr_str = formula_str.split('=', 1)[1].strip() + else: + expr_str = formula_str.strip() + + if not expr_str: + print(f"[Parse Error] 公式字符串为空或剥离后为空: '{formula_str}'") + return None + + local_dict = {"sin": sp.sin, "cos": sp.cos, "exp": sp.exp, "sqrt": sp.sqrt, "log": sp.log, + "arccos": sp.acos, "arcsin": sp.asin, "tan": sp.tan, "pi": sp.pi} + expr = sp.sympify(expr_str, locals=local_dict) + # 生成定义域 + variable_names = sorted([str(sym) for sym in expr.free_symbols]) + symbols = [sp.Symbol(name) for name in variable_names] + for sym in symbols: + local_dict[str(sym)] = sym + # 转换为 numpy 表达式 + numpy_modules = ['numpy', {'sqrt': np.sqrt, 'exp': np.exp, 'sin': np.sin, 'cos': np.cos, 'log': np.log, + 'arcsin': np.arcsin, 'arccos': np.arccos, 'tan': np.tan, 'pi': np.pi}] + func = sp.lambdify(symbols, expr, modules=numpy_modules) + return func, variable_names + except (SyntaxError, TypeError, AttributeError, sp.SympifyError) as e: + print(f'[Parse Error] 无法解析公式 "{formula_str}": {e}') + return None + except Exception as e: + print(f'[Parse Error] 解析公式 "{formula_str}" 时发生意外错误: {e}') + return None + + +def generate_samples(x0_range=(-10, 10), x1_range=(-10, 10), num_points=1000): + """ + 返回在定义域内的样本点 (x0, x1) + """ + x0_range = np.linspace(x0_range[0], x0_range[1], num_points) + x1_range = np.linspace(x1_range[0], x1_range[1], num_points) + x0, x1 = np.meshgrid(x0_range, x1_range) + x0_vals = x0.flatten() + x1_vals = x1.flatten() + return x0_vals, x1_vals + + +def change_data_to_prompt(points): + data_prompt = "" + for i in range(points.shape[0]): # 这行要根据变量数量改 + if points.shape[1] == 2: + data_prompt += f"""x0={points[i, 0]:.5f}, y={points[i, 1]:.5f}\n""" + elif points.shape[1] == 3: + data_prompt += f"""x0={points[i, 0]:.5f}, x1={points[i, 1]:.5f}, y={points[i, 2]:.5f}\n""" + elif points.shape[1] == 4: + data_prompt += f"""x0={points[i, 0]:.5f}, x1={points[i, 1]:.5f}, x2={points[i, 2]:.5f}, y={points[i, 3]:.5f}\n""" + elif points.shape[1] == 5: + data_prompt += f"""x0={points[i, 0]:.5f}, x1={points[i, 1]:.5f}, x2={points[i, 2]:.5f}, x3={points[i, 3]:.5f}, y={points[i, 4]:.5f}\n""" + return data_prompt + + +if __name__ == '__main__': + # example + data_path = 'test_data.pkl' + bootcamp = SymblocRegression(data_path) + case = bootcamp.case_generator()[0] # 选取1个case + print(bootcamp.prompt_func(case)) + example_answer = "y = x0 * x1" + print(f"answer: {example_answer}") + metrics = bootcamp._verify_correction(example_answer, case) + print(f'GT: {case['true_formula'].ljust(40)} | Pred: {example_answer.ljust(40)} | Score: {metrics["LLM_Score"]} | RMSE: {metrics["RMSE"]} | NMSE: {metrics["NMSE"]} | R2: {metrics["R2"]} | Match: {metrics["SymbolicMatch"]}') diff --git a/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_generator.py b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_generator.py new file mode 100644 index 0000000..227eb71 --- /dev/null +++ b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_generator.py @@ -0,0 +1,307 @@ +import json +import random +from typing import List, Dict, Any, Tuple, Set +import logging + +logger = logging.getLogger(__name__) +logging.basicConfig(level=logging.INFO) + +class HyperbatonGenerator: + def __init__(self): + # 形容词的种类 + self.adjective_types = [ + "age", # 年龄 + "quality", # 品质 + "size", # 大小 + "shape", # 形状 + "color", # 颜色 + "material", # 材料 + "nationality", # 国籍 + "activity" # 活动 + ] + # 定义标准顺序 (和 adjective_types 保持一致) + self.standard_order = self.adjective_types.copy() + # 每种类型的可能形容词 + self.adjectives = { + "color": ["red", "blue", "green", "yellow", "black", "white", "cyan", "magenta", + "violet", "brown", "gray", "pink", "crimson", "indigo", "maroon", + "teal", "aqua"], + "size": ["big", "small", "tiny", "enormous", "huge", "massive", "large", "little", + "medium-size", "midsize", "normal-size"], + "shape": ["square", "circular", "triangular", "rectangular", "spherical", "pyramidal", + "prismlike"], + "age": ["old", "new", "ancient", "old-fashioned", "brand-new", "archaic"], + "material": ["wood", "plastic", "steel", "iron", "glass", "paper", "cloth", + "cardboard", "rubber", "leather", "wool", "fiberglass"], + "nationality": ["chinese", "american", "russian", "japanese", "vietnamese", "iranian", + "turkish", "mexican", "brazilian", "german", "filipino", "ethiopian", + "indian", "egyptian", "nigerian", "thai", "indonesian", "congolese", + "bangladeshi", "pakistani"], + "activity": ["smoking", "hiking", "driving", "walking", "eating", "drinking", + "typing", "whittling", "snorkeling", "exercise"], + "quality": ["good", "terrible", "lovely", "awful", "nice", "wonderful", "repulsive", + "obnoxious", "mysterious", "ridiculous", "silly"] + } + + # 名词列表 + self.nouns = [ + "ball", "box", "bag", "chair", "table", "car", "book", "pen", "pencil", "bottle", + "cup", "knife", "fork", "spoon", "plate", "flower pot", "hat", "shirt", "shoe", + "guitar", "piano", "clock", "watch", "camera", "phone", "computer", "sofa", "bed", + "lamp", "umbrella", "car key", "wallet", "necklace", "ring", "earring", "brush", + "scissors", "candle", "vase", "banana", "apple", "dog", "bird", "hammer", "drill", + "screwdriver", "plier", "stapler", "ruler", "calculator", "saw", "trash can", + "fire extinguisher", "wrench", "bicycle", "speaker", "marker", "toolbox", "jar", + "bowl", "sunglasses", "canvas", "key", "house", "piano", "pencil", "tool", + "sunglasses", "knife", "spoon", "fork" + ] + # 最后再进行随机化顺序的处理 + self.randomized_order = random.sample(self.adjective_types, len(self.adjective_types)) + self.partial_order_pairs = self.generate_partial_order_pairs() + + # 设置随机种子 + random.seed(random.randint(1, 10000)) + + def generate_partial_order_pairs(self) -> Set[Tuple[str, str]]: + """生成部分排序规则对""" + pairs = set() + # 确保至少有N对直接的顺序关系 + min_pairs = len(self.adjective_types) + 2 + + while len(pairs) < min_pairs: + idx1, idx2 = random.sample(range(len(self.randomized_order)), 2) + if idx1 > idx2: + idx1, idx2 = idx2, idx1 + pairs.add((self.randomized_order[idx1], self.randomized_order[idx2])) + + return pairs + + def generate_example_sentence(self, noun: str, num_adjectives: int, use_partial: bool = False) -> str: + """生成示例句子,增强随机性""" + try: + if not hasattr(self, 'standard_order'): + raise AttributeError("standard_order attribute not initialized") + + if use_partial: + # 随机选择部分顺序 + available_types = random.sample(self.standard_order, min(num_adjectives, len(self.standard_order))) + else: + # 完整顺序,但随机截取前N个 + end_idx = random.randint(num_adjectives, len(self.standard_order)) + available_types = self.standard_order[:end_idx] + + # 为每个选中的形容词类型随机选择一个形容词 + adjectives = [] + for adj_type in available_types: + if self.adjectives[adj_type]: + adj = random.choice(self.adjectives[adj_type]) + adjectives.append(adj) + + # 随机截取指定数量的形容词 + adjectives = adjectives[:num_adjectives] + + return " ".join(adjectives) + " " + noun + except Exception as e: + logger.error(f"Error generating example sentence: {str(e)}") + raise + + def generate_example_sentences(self, num_sentences: int = 100) -> List[str]: + """生成示例句子,包括完整顺序和部分顺序的例子""" + sentences = [] + # 确保生成足够的示例来展示完整的排序规则 + min_full_order = 30 # 至少30个完整顺序的例子 + min_partial_order = 50 # 至少50个部分顺序的例子 + + # 生成完整顺序的例子 + for i in range(min_full_order): + noun = random.choice(self.nouns) + num_adjectives = random.randint(3, 5) # 使用更多形容词 + sentence = self.generate_example_sentence(noun, num_adjectives, use_partial=False) + sentences.append(f"({len(sentences) + 1}) {sentence}") + + # 生成部分顺序的例子 + for i in range(min_partial_order): + noun = random.choice(self.nouns) + num_adjectives = random.randint(2, 4) + sentence = self.generate_example_sentence(noun, num_adjectives, use_partial=True) + sentences.append(f"({len(sentences) + 1}) {sentence}") + + # 生成剩余的随机例子 + remaining = num_sentences - len(sentences) + for i in range(remaining): + noun = random.choice(self.nouns) + num_adjectives = random.randint(1, 3) + use_partial = random.choice([True, False]) + sentence = self.generate_example_sentence(noun, num_adjectives, use_partial) + sentences.append(f"({len(sentences) + 1}) {sentence}") + + random.shuffle(sentences) + return sentences + + def generate_test_sentence(self, correct: bool = True) -> str: + """生成测试句子,增强随机性""" + noun = random.choice(self.nouns) + num_adjectives = random.randint(3, 7) + + if correct: + # 正确句子保持标准顺序 + selected_types = self.standard_order[:num_adjectives] + else: + # 错误句子生成非标准顺序 + selected_types = random.sample(self.standard_order, num_adjectives) + # 确保顺序与标准顺序不同 + while selected_types == self.standard_order[:num_adjectives]: + random.shuffle(selected_types) + + # 为每个选中的形容词类型随机选择一个形容词 + adjectives = [] + for adj_type in selected_types: + if self.adjectives[adj_type]: + adj = random.choice(self.adjectives[adj_type]) + adjectives.append(adj) + + return " ".join(adjectives) + " " + noun + + def generate_options(self, num_options: int = 10, num_correct: int = None) -> Tuple[List[str], List[bool]]: + if num_correct is None: + num_correct = random.randint(3, 6) # 修改正确选项的数量范围 + + options = [] + is_correct = [] + + # 生成正确选项 + correct_sentences = set() + while len(correct_sentences) < num_correct: + sentence = self.generate_test_sentence(correct=True) + # 确保不会产生导致CEJ组合的选项 + if len(correct_sentences) >= 2: + current_correct = ''.join(sorted([chr(65 + i) for i in range(len(options)) + if is_correct[i]])) + if "CEJ" in (current_correct + chr(65 + len(options))): + continue + correct_sentences.add(sentence) + + options.extend(correct_sentences) + is_correct.extend([True] * num_correct) + + # 生成错误选项 + incorrect_sentences = set() + while len(incorrect_sentences) < (num_options - num_correct): + sentence = self.generate_test_sentence(correct=False) + incorrect_sentences.add(sentence) + + options.extend(incorrect_sentences) + is_correct.extend([False] * (num_options - num_correct)) + + # 打乱选项顺序 + combined = list(zip(options, is_correct)) + random.shuffle(combined) + options, is_correct = zip(*combined) + + return list(options), list(is_correct) + + def format_options(self, options: List[str]) -> List[str]: + """将选项格式化为(A), (B)等格式""" + formatted_options = [] + for i, option in enumerate(options): + letter = chr(65 + i) # A, B, C, ... + formatted_options.append(f"({letter}) {option}") + return formatted_options + + def get_answer_string(self, is_correct: List[bool]) -> str: + """将正确选项转换为答案字符串""" + correct_letters = [] + for i, correct in enumerate(is_correct): + if correct: + letter = chr(65 + i) + correct_letters.append(letter) + + return "K" if not correct_letters else "".join(correct_letters) + + def generate_task(self) -> Dict[str, Any]: + while True: + # 生成示例句子 + example_sentences = [] + while len(example_sentences) < random.randint(50, 180): + sentence = self.generate_example_sentence( + random.choice(self.nouns), + random.randint(3, 5) + ) + if sentence not in example_sentences: + example_sentences.append(sentence) + + # 生成测试选项 + options, is_correct = self.generate_options(10) + formatted_options = self.format_options(options) + + # 确保选项和示例句子之间没有重复 + all_sentences = example_sentences + [opt.split(") ")[1] for opt in formatted_options] + + # 获取答案字符串 + answer = self.get_answer_string(is_correct) + + # 检查是否包含CEJ组合 + if "CEJ" in answer: + continue # 如果包含CEJ,重新生成 + + if len(all_sentences) == len(set(all_sentences)): + # 构建任务输入 + task_input = ( + f"In a variant of English, we are given that the following sentences have correct adjective order:\n" + f"{' '.join(example_sentences)}\n\n" + f"In this variant of English, which of the following sentences (Options A-J) use the correct adjective order? " + f"If none of the sentences (Options A-J) use the correct adjective order, select option K. Select all that apply.\n" + f"{' '.join(formatted_options)}\n" + f"(K) None of the above\n\n" + f"Provide your final answer as a concatenation of all the correct choices. For example, if B and C have correct adjective order, " + f"then your final answer must be \"BC\"." + ) + + return { + "input": task_input, + "target": answer + } + + def validate_sentence(self, sentence: str) -> bool: + """验证生成的句子是否符合规则""" + words = sentence.split() + if len(words) < 2: # 至少要有一个形容词和一个名词 + return False + + # 检查名词是否在允许的列表中 + if words[-1] not in self.nouns: + return False + + # 检查形容词是否都是已知的 + adjectives = words[:-1] + for adj in adjectives: + found = False + for adj_list in self.adjectives.values(): + if adj in adj_list: + found = True + break + if not found: + return False + + return True + + def generate_task_batch(self, batch_size: int = 1) -> List[Dict[str, Any]]: + """直接生成一批任务,不保存到文件""" + tasks = [] + for _ in range(batch_size): + task = self.generate_task() + tasks.append(task) + return tasks +if __name__ == "__main__": + try: + generator = HyperbatonGenerator() + # 测试生成单个句子 + test_sentence = generator.generate_example_sentence("ball", 3, False) + print(f"Test sentence: {test_sentence}") + + # 生成完整数据集 + generator.save_dataset("hyperbaton_dataset.json", 100) + print("数据集已生成并保存到 hyperbaton_dataset.json") + except Exception as e: + print(f"Error: {str(e)}") + diff --git a/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_solver.py b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_solver.py new file mode 100644 index 0000000..6f5b0fe --- /dev/null +++ b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_solver.py @@ -0,0 +1,218 @@ +import json +from typing import List, Dict, Set, Tuple, Any +import re + + +class HyperbatonSolver: + def __init__(self): + # 标准形容词顺序 + self.adjective_types = [ + "age", + "quality", + "size", + "shape", + "color", + "material", + "nationality", + "activity" + ] + + # 详细的形容词词典,确保包含所有可能的形容词 + self.adjectives = { + "age": { + "old", "new", "ancient", "old-fashioned", "brand-new", "archaic" + }, + "quality": { + "good", "terrible", "lovely", "awful", "nice", "wonderful", + "repulsive", "obnoxious", "mysterious", "ridiculous", "silly" + }, + "size": { + "big", "small", "tiny", "enormous", "huge", "massive", "large", + "little", "medium-size", "midsize", "normal-size" + }, + "shape": { + "square", "circular", "triangular", "rectangular", "spherical", + "pyramidal", "prismlike" + }, + "color": { + "red", "blue", "green", "yellow", "black", "white", "cyan", + "magenta", "violet", "brown", "gray", "pink", "crimson", + "indigo", "maroon", "teal", "aqua" + }, + "material": { + "wood", "plastic", "steel", "iron", "glass", "paper", "cloth", + "cardboard", "rubber", "leather", "wool", "fiberglass" + }, + "nationality": { + "chinese", "american", "russian", "japanese", "vietnamese", + "iranian", "turkish", "mexican", "brazilian", "german", + "filipino", "ethiopian", "indian", "egyptian", "nigerian", + "thai", "indonesian", "congolese", "bangladeshi", "pakistani" + }, + "activity": { + "smoking", "hiking", "driving", "walking", "eating", "drinking", + "typing", "whittling", "snorkeling", "exercise" + } + } + + # 创建反向查找字典 + self.word_to_type = {} + for type_name, words in self.adjectives.items(): + for word in words: + self.word_to_type[word] = type_name + + def extract_examples(self, text: str) -> List[str]: + """提取示例句子""" + # 找到示例部分的开始 + start = text.find("correct adjective order:\n") + len("correct adjective order:\n") + end = text.find("\n\nIn this variant") + examples_text = text[start:end].strip() + return [s.strip() for s in examples_text.split('\n') if s.strip()] + + def extract_options(self, text: str) -> List[Tuple[str, str]]: + """提取选项""" + options = [] + # 使用正则表达式匹配选项 + pattern = r'\(([A-K])\) (.*?)(?=\([A-K]\)|$)' + matches = re.finditer(pattern, text, re.DOTALL) + for match in matches: + label = match.group(1) + sentence = match.group(2).strip() + options.append((label, sentence)) + return options + + def get_adjective_sequence(self, words: List[str]) -> List[str]: + """获取词序列中的形容词类型序列""" + sequence = [] + for word in words[:-1]: # 排除最后的名词 + word_type = self.word_to_type.get(word.lower()) + if word_type: + sequence.append(word_type) + return sequence + + def is_valid_adjective_order(self, words: List[str]) -> bool: + """检查形容词顺序是否正确""" + sequence = self.get_adjective_sequence(words) + if not sequence: + return False + + # 检查相邻形容词的顺序 + last_seen_index = -1 + for adj_type in sequence: + try: + current_index = self.adjective_types.index(adj_type) + # 如果当前索引小于上一个索引,说明顺序错误 + if current_index <= last_seen_index: + return False + last_seen_index = current_index + except ValueError: + # 如果形容词类型不在预定义列表中 + return False + return True + + def parse_puzzle(self, puzzle: str) -> Tuple[str, List[Tuple[str, str]]]: + """解析谜题文本,返回示例文本和选项列表""" + # 提取示例部分 + examples_text = self.extract_examples(puzzle) + + # 提取选项 + options = self.extract_options(puzzle) + + return examples_text, options + + def learn_from_examples(self, examples: List[str]) -> List[List[str]]: + """从示例中学习正确的形容词序列""" + sequences = [] + for example in examples: + words = example.split() + sequence = self.get_adjective_sequence(words) + if sequence: + sequences.append(sequence) + return sequences + + def get_word_type(self, word: str) -> str: + """获取单词的形容词类型""" + # 转换为小写以确保匹配 + word = word.lower() + + # 直接从word_to_type字典中获取类型 + return self.word_to_type.get(word) + + def solve_puzzle(self, puzzle: str) -> str: + examples_text, options = self.parse_puzzle(puzzle) + + # 找出所有合法的选项 + valid_options = [] + for label, sentence in options: + if label == 'K': + continue + words = sentence.split() + if self.is_valid_adjective_order(words): + valid_options.append(label) + + # 如果没有合法选项,返回"K" + if not valid_options: + return "K" + + # 检查是否会生成CEJ组合 + answer = ''.join(sorted(valid_options)) + if "CEJ" in answer: + # 移除一个选项以避免CEJ组合 + if 'C' in valid_options and 'E' in valid_options and 'J' in valid_options: + valid_options.remove('J') # 或者移除其他选项 + + return ''.join(sorted(valid_options)) + + def solve_batch(self, tasks: List[Dict[str, Any]]) -> List[str]: + """直接解决一批任务""" + results = [] + for task in tasks: + solved = self.solve_puzzle(task['input']) + results.append(solved) + return results + + +def debug_adjective_order(solver: HyperbatonSolver, sentence: str): + """调试函数:显示句子中形容词的类型和顺序""" + words = sentence.split() + sequence = [] + for word in words[:-1]: + word_type = solver.word_to_type.get(word.lower()) + sequence.append((word, word_type)) + return sequence + + +def main(): + solver = HyperbatonSolver() + results = solver.solve_dataset('hyperbaton_dataset.json') + + print("求解结果:") + correct = 0 + total = len(results) + + # 详细输出每个谜题的结果 + for i, (solved, expected) in enumerate(results, 1): + is_correct = solved == expected + status = "✓" if is_correct else "✗" + if is_correct: + correct += 1 + print(f"谜题 {i}: 预期答案={expected}, 求解结果={solved} {status}") + + # 如果答案错误,添加调试信息 + if not is_correct: + print(f"调试信息:") + with open('hyperbaton_dataset.json', 'r', encoding='utf-8') as f: + dataset = json.load(f) + puzzle = dataset['examples'][i - 1] + options = solver.extract_options(puzzle['input']) + for label, sentence in options: + if label in expected: + print(f"应该正确的选项 {label}: {sentence}") + print("形容词序列:", debug_adjective_order(solver, sentence)) + + accuracy = (correct / total) * 100 + print(f"\n准确率: {accuracy:.2f}% ({correct}/{total})") + + +if __name__ == "__main__": + main() diff --git a/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_validor.py b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_validor.py new file mode 100644 index 0000000..f9afe4e --- /dev/null +++ b/internbootcamp/libs/bbeh_hyperbaton/bbeh_hyperbaton_validor.py @@ -0,0 +1,158 @@ +import json +from typing import Dict, List, Optional, Any + + +class HyperbatonValidator: + def __init__(self): + """简化初始化,不再需要加载数据集""" + pass + + def validate_batch(self, tasks: List[Dict[str, Any]], predictions: List[str]) -> Dict[str, Any]: + """直接验证一批预测结果""" + if len(predictions) != len(tasks): + raise ValueError("预测结果数量与任务数量不匹配") + + results = [] + correct_count = 0 + total_count = len(predictions) + + for i, (task, pred) in enumerate(zip(tasks, predictions)): + expected = task['target'] + cleaned_pred = ''.join(c for c in pred if c in "ABCDEFGHIJK") + is_correct = cleaned_pred == expected + + if is_correct: + correct_count += 1 + + results.append({ + "task_id": i, + "prediction": cleaned_pred, + "target": expected, + "is_correct": is_correct + }) + + return { + "accuracy": correct_count / total_count, + "correct_count": correct_count, + "total_count": total_count, + "detailed_results": results + } + + def validate_predictions(self, predictions_file: str) -> Dict[str, float]: + """验证预测结果的准确性""" + # 加载预测结果 + with open(predictions_file, 'r', encoding='utf-8') as f: + predictions = json.load(f) + + # 确保预测结果的格式正确 + if not isinstance(predictions, list): + raise ValueError("预测结果必须是列表格式") + + if len(predictions) != len(self.dataset["examples"]): + raise ValueError(f"预测结果数量({len(predictions)})与数据集样例数量({len(self.dataset['examples'])})不匹配") + + # 验证每个预测结果 + correct_count = 0 + total_count = len(predictions) + + results = [] + + for i, (pred, example) in enumerate(zip(predictions, self.dataset["examples"])): + expected = example["target"] + + # 清理预测结果,只保留大写字母 + cleaned_pred = ''.join(c for c in pred if c in "ABCDEFGHIJK") + + # 检查预测是否正确 + is_correct = cleaned_pred == expected + + if is_correct: + correct_count += 1 + + results.append({ + "example_id": i, + "prediction": cleaned_pred, + "target": expected, + "is_correct": is_correct + }) + + # 计算总体准确率 + accuracy = correct_count / total_count if total_count > 0 else 0 + + # 返回结果统计 + return { + "accuracy": accuracy, + "correct_count": correct_count, + "total_count": total_count, + "detailed_results": results + } + + def validate_single_prediction(self, example_id: int, prediction: str) -> Dict[str, any]: + """验证单个预测结果""" + if example_id < 0 or example_id >= len(self.dataset["examples"]): + raise ValueError(f"例子ID {example_id} 超出范围 [0, {len(self.dataset['examples']) - 1}]") + + example = self.dataset["examples"][example_id] + expected = example["target"] + + # 清理预测结果,只保留大写字母 + cleaned_pred = ''.join(c for c in prediction if c in "ABCDEFGHIJK") + + # 检查预测是否正确 + is_correct = cleaned_pred == expected + + return { + "example_id": example_id, + "prediction": cleaned_pred, + "target": expected, + "is_correct": is_correct + } + + def get_example(self, example_id: int) -> Optional[Dict[str, str]]: + """获取指定ID的例子""" + if example_id < 0 or example_id >= len(self.dataset["examples"]): + return None + + return self.dataset["examples"][example_id] + + def generate_report(self, results: Dict[str, any], output_file: str = None): + """生成验证报告""" + report = [ + "# BBEH Hyperbaton任务验证报告", + "", + f"总体准确率: {results['accuracy']:.2%} ({results['correct_count']}/{results['total_count']})", + "", + "## 详细结果", + "", + "| 例子ID | 预测 | 目标 | 正确? |", + "| ------ | ---- | ---- | ------ |", + ] + + for result in results["detailed_results"]: + correct_mark = "✓" if result["is_correct"] else "✗" + report.append(f"| {result['example_id']} | {result['prediction']} | {result['target']} | {correct_mark} |") + + report_text = "\n".join(report) + + if output_file: + with open(output_file, 'w', encoding='utf-8') as f: + f.write(report_text) + + return report_text + + +# 使用示例 +if __name__ == "__main__": + validator = HyperbatonValidator("hyperbaton_dataset.json") + + # 假设我们有一个预测结果文件 + try: + results = validator.validate_predictions("predictions.json") + report = validator.generate_report(results, "validation_report.md") + print("验证报告已生成并保存到 validation_report.md") + except FileNotFoundError: + print("预测文件不存在,请先生成预测结果") + + # 作为演示,我们可以测试单个预测 + single_result = validator.validate_single_prediction(0, "E") + print(f"单个验证结果: {single_result}")