| .. | ||
| category.py | ||
| config.yaml | ||
| display_score.py | ||
| label.py | ||
| README.md | ||
| vision_config.yaml | ||
Download dataset
We have pre-generated several category classifier benchmarks and ground truths. You can download them (with git-lfs installed) to the directory classify/ by running
> git clone https://huggingface.co/datasets/lmarena-ai/categories-benchmark-eval
// cd into classify/ and then copy the label_bench directory to the current directory
> cp -r categories-benchmark-eval/label_bench .
Your label_bench directory should follow the structure:
├── label_bench/
│ ├── creative_writing_bench/
│ │ ├── data/
│ │ │ └── llama-v3p1-70b-instruct.json
│ │ └── test.json
│ ├── ...
│ ├── your_bench_name/
│ │ ├── data/
│ │ │ ├── your_classifier_data_1.json
│ │ │ ├── your_classifier_data_2.json
│ │ │ └── ...
│ │ └── test.json (your ground truth)
└── ...
How to evaluate your category classifier?
To test your new classifier for a new category, you would have to make sure you created the category child class in category.py. Then, to generate classification labels, make the necessary edits in config.yaml and run
python label.py --config config.yaml --testing
If you are labeling a vision category, add the --vision flag to the command. This will add a new column to the input data called image_path that contains the path to the image corresponding to each conversation. Ensure that you update your config with the correct image_dir where the images are stored.
Then, add your new category bench to tag_names in display_score.py. After making sure that you also have a correctly formatted ground truth json file, you can report the performance of your classifier by running
python display_score.py --bench <your_bench>
If you want to check out conflicts between your classifier and ground truth, use
python display_score.py --bench <your_bench> --display-conflict
Example output:
> python display_score.py --bench if_bench --display-conflict
Model: gpt-4o-mini-2024-07-18
Accuracy: 0.967
Precision: 0.684
Recall: 0.918
###### CONFLICT ######
Ground Truth = True; Pred = False
\####################
...
Ground Truth = False; Pred = True
\####################
...