mirror of
https://github.com/lilakk/BLEUBERI.git
synced 2026-04-19 12:58:12 +00:00
75 lines
2.3 KiB
Markdown
75 lines
2.3 KiB
Markdown
# Instruction
|
|
|
|
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models.
|
|
We will provide you with the user query and a pair of AI-generated responses (Response A and Response B).
|
|
You should first read the user query and the conversation history carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided below.
|
|
|
|
# Conversation between User and AI
|
|
|
|
## History
|
|
<|begin_of_history|>
|
|
|
|
{$history}
|
|
|
|
<|end_of_history|>
|
|
|
|
## Current User Query
|
|
<|begin_of_query|>
|
|
|
|
{$user_query}
|
|
|
|
<|end_of_query|>
|
|
|
|
## Response A
|
|
<|begin_of_response_A|>
|
|
|
|
{$candidate_A}
|
|
|
|
<|end_of_response_A|>
|
|
|
|
## Response B
|
|
<|begin_of_response_B|>
|
|
|
|
{$candidate_B}
|
|
|
|
<|end_of_response_B|>
|
|
|
|
# Evaluation
|
|
|
|
## Checklist
|
|
|
|
<|begin_of_checklist|>
|
|
|
|
{$checklist}
|
|
|
|
<|end_of_checklist|>
|
|
|
|
Please use this checklist to guide your evaluation, but do not limit your assessment to the checklist.
|
|
|
|
## Rules
|
|
|
|
You should compare the above two responses based on your analysis of the user queries and the conversation history.
|
|
You should first write down your analysis and the checklist that you used for the evaluation, and then provide your assessment according to the checklist.
|
|
There are five choices to give your final assessment: ["A++", "A+", "A=B", "B+", "B++"], which correspond to the following meanings:
|
|
|
|
- `A++`: Response A is much better than Response B.
|
|
- `A+`: Response A is only slightly better than Response B.
|
|
- `A=B`: Response A and B are of the same quality. Please use this choice sparingly.
|
|
- `B+`: Response B is only slightly better than Response A.
|
|
- `B++`: Response B is much better than Response A.
|
|
|
|
|
|
## Output Format
|
|
First, please output your analysis for each model response, and then summarize your assessment to three aspects: "reason A=B", "reason A>B", and "reason B>A", and finally make your choice for the final assessment.
|
|
|
|
Please provide your evaluation results in the following json format by filling in the placeholders in []:
|
|
```
|
|
{
|
|
"analysis of A": "[analysis of Response A]",
|
|
"analysis of B": "[analysis of Response B]",
|
|
"reason of A=B": "[where Response A and B perform equally well]",
|
|
"reason of A>B": "[where Response A is better than Response B]",
|
|
"reason of B>A": "[where Response B is better than Response A]",
|
|
"choice": "[A++ or A+ or A=B or B+ or B++]",
|
|
}
|
|
```
|