Hey friends, I've been thinking and experimenting a lot with how to apply, evaluate, and operate LLM-evaluators and have gone down the rabbit hole on papers and results. Here's a writeup on what I've learned, as well as my intuition on it. It's a very long piece (49 min read) and so I'm only sending you the intro section. It'll be easier to read the full thing on my site. I appreciate you receiving this, but if you want to stop, simply unsubscribe. đ Read in browser for best experience (web version has extras & images) đ LLM-evaluators, also known as âLLM-as-a-Judgeâ, are large language models (LLMs) that evaluate the quality of another LLMâs response to an instruction or query. Their growing adoption is partly driven by necessity. LLMs can now solve increasingly complex and open-ended tasks such as long-form summarization, translation, and multi-turn dialogue. As a result, conventional evals that rely on n-grams, semantic similarity, or a gold reference have become less effective at distinguishing good responses from the bad. And while we can rely on human evaluation or finetuned task-specific evaluators, they require significant effort and high-quality labeled data, making them difficult to scale. Thus, LLM-evaluators offer a promising alternative. If youâre considering using an LLM-evaluator, this is written for you. Drawing from two dozen papers, weâll discuss:
After reading this, youâll gain an intuition on how to apply, evaluate, and operate LLM-evaluators. Weâll learn when to apply (i) direct scoring vs. pairwise comparisons, (ii) correlation vs. classification metrics, and (iii) LLM APIs vs. finetuned evaluator models. Key considerations before adopting an LLM-evaluatorBefore reviewing the literature on LLM-evaluators, letâs first discuss a few questions which will help us interpret the findings as well as figure out how to use an LLM-evaluator. First, what baseline are we comparing an LLM-evaluator against? For example, if weâre prompting an LLM API, are we comparing it to human annotators or a smaller, finetuned evaluator model? Itâs easier to match the former than the latter on accuracy and speed. Most folks have human annotators as the baseline. Here, we aim for the LLM-human correlation to match human-human correlation. Compared to human annotators, LLM-evaluators can be orders of magnitude faster and cheaper, as well as more reliable. On the other hand, if your baseline is a finetuned classifier or reward model, then the goal is for the LLM-evaluator to achieve similar recall and precision as a finetuned classifier. This is a more challenging baseline. Furthermore, LLM-evaluators are unlikely to match the millisecond-level latency of a small finetuned evaluator, especially if the former requires Chain-of-Thought (CoT). LLM-evaluators likely also cost more per inference. Second, how will we score responses via LLM-evaluators? There are at least three approaches that provide varying levels of accuracy, reliablity, and flexibility. Direct scoring evaluates a single response without needing an alternative for comparison. This makes it more versatile than pairwise comparison. Because it scores output directly, itâs more suitable for objective assessments such as measuring faithfulness to a source text or detecting policy violations such as toxicity. Pairwise comparison chooses the better of two responses or declares a tie. Itâs typically usedâand more reliableâfor subjective evals such as persuasiveness, tone, coherence, etc. Studies show that pairwise comparisons lead to more stable results and smaller differences between LLM judgments and human annotations relative to direct scoring. Reference-based evaluation involves comparing the response being evaluated to a gold reference. The reference contains the information that should be included in the generated response. The LLM-evaluator evaluates how close the generated response matches the reference, essentially doing a more sophisticated form of fuzzy-matching. These three approaches are not interchangeable. Some evaluation tasks, such as assessing faithfulness or instruction-following, donât fit the pairwise comparison paradigm. For example, a response is either faithful to the provided context or it is notâevaluating a response as more faithful than the alternative address the eval criteria. Similarly, reference-based evaluations require annotated references, while direct scoring and pairwise comparisons do not. Finally, what metrics will we use to evaluate LLM-evaluators? Classification and correlation metrics are typically adopted in the literature and industry. Classification metrics are more straightforward to apply and interpret. For example, we can evaluate the recall and precision of an LLM-evaluator at the task of evaluating the factual inconsistency or toxicity of responses. Or we could assess the LLM-evaluatorâs ability to pick the more preferred response via pairwise comparison. Either way, we can frame it as a binary task and rely on good olâ classification metrics.
Diagnostic plots for classification tasks (source) Correlation metrics are trickier to interpret. Some commonly used correlation metrics include Cohenâs Îș (kappa), Kendallâs Ï (tau), and Spearmanâs Ï (rho). Cohenâs Îș measures the agreement between two raters on categorical data, taking into account the probability of agreement occurring due to chance. It ranges from -1 to 1, with 0 indicating no agreement beyond chance and 1 indicating perfect agreement. It is generally more conservative compared to other correlation metrics. Values of 0.21 - 0.40 can be interpreted as fair agreement while 0.41 - 0.60 suggest moderate agreement. Kendallâs Ï and Spearmanâs Ï measures the strength and direction of the association between two rankings. It ranges from -1 to 1. -1 indicates perfect negative correlation, 1 indicates perfect positive correlation, and 0 suggests no correlation. Kendallâs Ï is more robust to outliers due to its focus on the relative ordering of pairs while Spearmanâs Ï is more sensitive to the magnitude of differences between ranks. They typically have higher values compared to Cohenâs Îș since they donât adjust for chance agreement. When choosing a metric, consider the type of data youâre working with. Cohenâs Îș is more suitable for binary or categorical data when you want to assess the agreement between raters while adjusting for chance agreement. However, it may over-penalize ordinal data, such as a Likert scale. If your data is ordinal, consider Kendallâs Ï or Spearmanâs Ï instead. I tend to be skeptical of correlation metrics. They donât account for chance agreement and thus could be overoptimistic (though Cohenâs Îș is an exception). Furthermore, compared to classification metrics, itâs less straightforward to translate correlation metrics to performance in production. (Whatâs the evaluatorâs recall on bad responses? What about false positive rate?) Thus, where possible, I have my evaluators return binary outputs. This improves model performance while making it easier to apply classification metrics. Continue reading here. |
I build ML, RecSys, and LLM systems that serve customers at scale, and write about what I learn along the way. Join 7,500+ subscribers!
Hey friends, To better understand MCPs and agentic workflows, I built a news agent to help me generate a daily news summary. Itâs built on Amazon Q CLI and MCP. The former provides the agentic framework and the latter provides news feeds via tools. It also uses tmux to spawn and display each sub-agentâs work. P.S. If youâre interested in topics like this, my friends Ben and Swyx are organizing the AI Engineer Worldâs Fair in San Francisco on 3rd - 5th June. Come talk to builders sharing their...
Hey friends, I've seen many teams misunderstand what it means to build and apply product evals and wrote this piece to address it. I hope it clarifies that evals aren't a one and done artifact, but a disciplined process. Do you agree or disagree? Please reply and let me know! P.S., In May, my friends Hamel Husain and Shreya Shankar are teaching an exclusive 4-week course on "AI Evals for Engineers & PMs". They've generously provided a special 40% discount link đ€«âbut hurry, limited spots...
Hey friends, Every month or so, I receive questions about my writing: âHow did you get started?â âWhy do you write?â âWho do you write for?â âWhatâs your writing process?â Iâve procrastinated on writing this FAQ because, honestly, who cares about my writing process? But after answering the same questions again and again, I realized itâd be helpful to consolidate my responses somewhere. At the very least, itâll save me from repeating myself. If youâre thinking about writing online but arenât...