BLEUBERI/eval/WildBench/evaluation/eval_template.pairwise.v2.md
2025-06-04 20:36:43 +00:00

75 lines
2.3 KiB
Markdown

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models.
We will provide you with the user query and a pair of AI-generated responses (Response A and Response B).
You should first read the user query and the conversation history carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided below.
# Conversation between User and AI
## History
<|begin_of_history|>
{$history}
<|end_of_history|>
## Current User Query
<|begin_of_query|>
{$user_query}
<|end_of_query|>
## Response A
<|begin_of_response_A|>
{$candidate_A}
<|end_of_response_A|>
## Response B
<|begin_of_response_B|>
{$candidate_B}
<|end_of_response_B|>
# Evaluation
## Checklist
<|begin_of_checklist|>
{$checklist}
<|end_of_checklist|>
Please use this checklist to guide your evaluation, but do not limit your assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the user queries and the conversation history.
You should first write down your analysis and the checklist that you used for the evaluation, and then provide your assessment according to the checklist.
There are five choices to give your final assessment: ["A++", "A+", "A=B", "B+", "B++"], which correspond to the following meanings:
- `A++`: Response A is much better than Response B.
- `A+`: Response A is only slightly better than Response B.
- `A=B`: Response A and B are of the same quality. Please use this choice sparingly.
- `B+`: Response B is only slightly better than Response A.
- `B++`: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and then summarize your assessment to three aspects: "reason A=B", "reason A>B", and "reason B>A", and finally make your choice for the final assessment.
Please provide your evaluation results in the following json format by filling in the placeholders in []:
```
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
```