Skip to main content

Command Palette

Search for a command to run...

MT-Bench: LLM-as-a-judge benchmark

Updated
3 min read

With the rapid growth of research on large language models (LLMs), we now have a diverse array of models capable of performing various tasks. Current benchmarks for LLMs only focus on evaluating models on close-ended questions, with short responses. Which makes is extremely hard to accurately reflect how well LLMs can perform in real-world open-ended questions.

Following are the major benchmark categories which are generally used to evaluate LLMs:

  • Core-knowledge benchmark: require LLMs to generate short answer to predetermined set of questions that are easy to evaluate. HumanEval, MMLU, GSM-8k, AGIEval are some example of core knowledge benchmarks.

  • Instruction-following benchmark: evaluate if LLMs are capable of following instructions properly. For-example: Flan, Self-Instruct, NaturalInstructions, Super-NaturalInstructions.

  • Conversational benchmark: evaluate the LLM’s capability to handle conversations. CoQA, MMDialog, OpenAssistant are some example benchmarks.

These benchmark greatly overlook open-ended nature of human-AI interactions. Human preference is one of the major factor in LLMs utility. To solve this issue, the concept of LLM-as-a-judge was introduced. Human preference is replaced with state-of-art LLM such as GPT-4, as they are fine tuned with Reinforcement Learning from Human Feedback (RLHF).

MT-Bench

MT-Bench is a benchmark that is designed to test multi-turn conversations and instruction-following capabilities of chatbot on 8 common categories which include: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), knowledge II (humanities/social science). MT-Bench has 10 open-ended multi-turn questions for each categories.

MT-Bench benchmark was introduced in this paper. Along with this, authors have introduced chatbot arena, a platform where human can score the anonymous LLMs outputs based on their response. These votes are helpful to test for human preference in LLM generated response.

Types of LLM-as-a-judge

  • Pairwise comparison: An LLM judge is presented with a question and two answers, tasked to determine which is better or it is a tie.

  • Single answer grading: An LLM judge is asked to directly assign a score to a single answer

  • Reference-guided grading: In some cases reference solution is provided to make better judgement.

Advantages of LLM-as-a-judge

  • Scalability: With LLM-as-a-judge we can reduce the human involvement to a minimum and evaluate the LLMs in faster iterations.

  • Explainability: By using LLM-as-a-judge, we can have detailed explanation for assigned scores as compared to human scoring.

Limitations of LLM-as-a-Judge

Position bias: Sometimes LLMs exhibit inclination to favor certain position over others. This happens when judge LLM switches its answer based on which agent’s response is placed first. If the answer switches then it indicates position bias

Verbosity bias: is when LLM judge favors longer, verbose response even if they are unclear or inaccurate as compared to shorter answer.

Self-enhancement bias: is when judge LLM favors answers generated by itself.

Limited capability to grade math and reasoning: LLMs are known to have limited math and reasoning capability. But sometimes judge LLMs can not grade solvable math problems.

Addressing limitations

Swapping position: Position bias is mitigated by running the same prompt twice with answers in different position. If the answer of the judge LLM switches then the both LLM are marked as tied.

Few-shot judge: By using few-shot examples, the position bias issue decreased significantly. After using few-shot samples in testing, it was noted that the consistency of GPT-4 increased from 65.0% to 77.5%.

Chain-of-thought and reference guide judge: To improve the math & reasoning scoring issue, chain-of-thought and reference guid judge was introduced. After this, the error rate reduced significantly from 70% to 15%

Fine-tuning a judge: Fine tuning Vicuna-13B on arena data, increased the consistency of evaluation to 65%.

4 views