How to evaluate LLMs automatically
December 20, 2023
Reference: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
TL;DR:
- GPT-4 scoring has same-level of agreement as human graders, i.e., 85%
- Next best open-source model is Vicuna 13B in the paper but I think Mistral-7B might be a strong contendor
- Best method for
- automated evaluation is Single answer grading on MT-bench
- task evaluation is to create gold-labeled dataset and use single answer grading
- relative comparison is to calculate ELO rating using blind voting
- open-source judge is to use Chatbot Arena dataset to fine-tune your model
- LLM judges suffer from biases:
- position: in which order the candidates appear in prompt
- verbosity: favor longer answers even if the information is repeated
- self-enhancement: favor answers generated by self
MT-bench
Contains 80 high-quality multi-turn questions across 8 common categories:
- Writing
- Roleplay
- Extraction
- Reasoning
- Math
- Coding
- Knowledge I (STEM)
- Knowledge II (humanities/social science)
Dataset:
https://huggingface.co/spaces/lmsys/mt-bench/tree/main/data/mt_bench
Related links
MT-bench results browser:
https://huggingface.co/spaces/lmsys/mt-bench
ChatArena Leaderboard:
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
JudgeLM repo:
https://github.com/baaivision/JudgeLM