SARTHAK LANGDE

How to evaluate LLMs automatically?

December 20, 2023

Reference: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

TL;DR:


MT-bench

Contains 80 high-quality multi-turn questions across 8 common categories:

Dataset:
https://huggingface.co/spaces/lmsys/mt-bench/tree/main/data/mt_bench


Related links

MT-bench results browser:
https://huggingface.co/spaces/lmsys/mt-bench

ChatArena Leaderboard:
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

JudgeLM repo:
https://github.com/baaivision/JudgeLM