How Does Evaluation Work?

Grading the performance of your LLM application

The purpose of evaluation

Developers using LLMs build applications to respond to user queries, transform or generate content, and classify and structure data.

It’s extremely easy to start building an AI application using LLMs because developers no longer have to collect labeled data or train a model. They only need to create a prompt to ask the model for what they want. However, this comes with tradeoffs. LLMs are generalized models that aren’t fine tuned for a specific task. With a standard prompt, these applications demo really well, but in production environments, they often fail in more complex scenarios.

You need a way to judge the quality of your LLM outputs. An example would be judging the quality of these chat outputs on relevance, hallucination %, and latency.

When you adjust your prompts or retrieval strategy, you will know whether your application has improved and by how much using evaluation. The dataset you are evaluating determines how trustworthy generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.

Creating an LLM evaluation

There are three components to an LLM evaluation:

  1. The input data: the input, output, and prompt variables from your LLM application, depending on what you are trying to critique or evaluate.

  2. The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.

  3. The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.

  4. The aggregate metric: when you run thousands of evaluations across a large dataset, you can use your aggregation metrics to summarize the quality of your responses over time across different prompts, retrievals, and LLMs.

LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.

There are several other ways of evaluating your application, including using human labeled feedback and code-based heuristics. Based on what you're trying to evaluate, these approaches could be better suited for your application.

Read our research on task-based evaluation and our best practices guide to learn more!

Last updated

Copyright © 2023 Arize AI, Inc