update training dir with external eval details (#437)

* added games

* added llama 3b training conf

* update readme with details of external evals

* readme update

---------

Co-authored-by: joesharratt1229 <joesharratt1229@gmail.com>
This commit is contained in:
Oliver Stanley 2025-05-18 23:35:41 +01:00 committed by GitHub
parent 5961a10145
commit add527ada1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 374 additions and 0 deletions

View file

@ -0,0 +1,26 @@
task: llama_math_algebra
dataset_path: EleutherAI/hendrycks_math
process_docs: !function utils.process_docs
dataset_name: algebra
output_type: generate_until
training_split: train
test_split: test
doc_to_text: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI Assistant that provides well-reasoned and detailed responses.\nYou first think about the reasoning process as an internal monologue and then provide the user with the answer.\nRespond in the following format:\n<think>\n...\n</think>\n<answer>\n...\n</answer><|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSolve the following math problem efficiently and clearly:\n\n- For simple problems (2 steps or fewer):\nProvide a concise solution with minimal explanation.\n\n- For complex problems (3 steps or more):\nUse this step-by-step format:\n\n## Step 1: [Concise description]\n[Brief explanation and calculations]\n\n## Step 2: [Concise description]\n[Brief explanation and calculations]\n\n...\n\nRegardless of the approach, always conclude with:\n\nTherefore, the final answer is: $\\\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.\n\nProblem: {{ problem }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
process_results: !function utils.process_results
doc_to_target: "{{answer if few_shot is undefined else solution}}"
generation_kwargs:
until:
- "Problem:"
- "</answer>"
max_gen_toks: 4096
do_sample: false
temperature: 0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
num_fewshot: 0
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true