How to evaluate LLMs automatically

December 20, 2023

Reference: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

TL;DR:

GPT-4 scoring has same-level of agreement as human graders, i.e., 85%
Next best open-source model is Vicuna 13B in the paper but I think Mistral-7B might be a strong contendor
Best method for
- automated evaluation is Single answer grading on MT-bench
- task evaluation is to create gold-labeled dataset and use single answer grading
- relative comparison is to calculate ELO rating using blind voting
- open-source judge is to use Chatbot Arena dataset to fine-tune your model
LLM judges suffer from biases:
- position: in which order the candidates appear in prompt
- verbosity: favor longer answers even if the information is repeated
- self-enhancement: favor answers generated by self

MT-bench

Contains 80 high-quality multi-turn questions across 8 common categories:

Dataset:

https://huggingface.co/spaces/lmsys/mt-bench/tree/main/data/mt_bench