BLEUBERI

mirror of https://github.com/lilakk/BLEUBERI.git synced 2026-04-19 12:58:12 +00:00

History

yapei 70a66e4a16 initial commit		2025-06-04 20:36:43 +00:00
..
category.py	initial commit	2025-06-04 20:36:43 +00:00
config.yaml	initial commit	2025-06-04 20:36:43 +00:00
display_score.py	initial commit	2025-06-04 20:36:43 +00:00
label.py	initial commit	2025-06-04 20:36:43 +00:00
README.md	initial commit	2025-06-04 20:36:43 +00:00
vision_config.yaml	initial commit	2025-06-04 20:36:43 +00:00

README.md

Download dataset

We have pre-generated several category classifier benchmarks and ground truths. You can download them (with git-lfs installed) to the directory classify/ by running

> git clone https://huggingface.co/datasets/lmarena-ai/categories-benchmark-eval
// cd into classify/ and then copy the label_bench directory to the current directory
> cp -r categories-benchmark-eval/label_bench .

Your label_bench directory should follow the structure:

├── label_bench/
│   ├── creative_writing_bench/
│   │   ├── data/
│   │   │    └── llama-v3p1-70b-instruct.json
│   │   └── test.json
│   ├── ...
│   ├── your_bench_name/
│   │   ├── data/
│   │   │    ├── your_classifier_data_1.json
│   │   │    ├── your_classifier_data_2.json
│   │   │    └── ...
│   │   └── test.json (your ground truth)
└── ...

How to evaluate your category classifier?

To test your new classifier for a new category, you would have to make sure you created the category child class in category.py. Then, to generate classification labels, make the necessary edits in config.yaml and run

python label.py --config config.yaml --testing

If you are labeling a vision category, add the --vision flag to the command. This will add a new column to the input data called image_path that contains the path to the image corresponding to each conversation. Ensure that you update your config with the correct image_dir where the images are stored.

Then, add your new category bench to tag_names in display_score.py. After making sure that you also have a correctly formatted ground truth json file, you can report the performance of your classifier by running

python display_score.py --bench <your_bench>

If you want to check out conflicts between your classifier and ground truth, use

python display_score.py --bench <your_bench> --display-conflict

Example output:

> python display_score.py --bench if_bench --display-conflict
Model: gpt-4o-mini-2024-07-18
Accuracy: 0.967
Precision: 0.684
Recall: 0.918

###### CONFLICT ######

Ground Truth = True; Pred = False
\####################
...

Ground Truth = False; Pred = True
\####################
...