You need evals - a primer and new techniques

#ai #llm #draft *This was written in September 2024. If you are reading this substantially after that date, the content may be less relevant to the current state of the industry* This post is a broad introduction to why evals, common types of evals, and ### Outline - Why evals - Value you get - How to implement (high level) - Programmatic evals - LLM evals - Meta-evals - Current challenges - Research directions > [!NOTE] In short > - Evals are critical to fast development and to deploying an LLM-enabled production application. However, it's tricky to do this well. > - You need to create traces, create datasets, decide your criteria, implement programmatic and LLM evaluators, and use your evals actively during development. > - You should use programmatic evals where possible (e.g. string comparison, keyword checking, NLP sentiment analysis). > - LLM as a judge is straightforward but requires work to align to human feedback. This is good for qualitative criteria like verbosity, but requires care for things like hallucination checking and accuracy. > - The development cycle of identifying problems, modifying the prompt, running your evals, and adding to the dataset is key for rapid development. > - The most pressing research challenges in evals revolve around their complexity of implementation, their alignment with human evaluation, their need to change over time, and the lack of solid tools for building them. ## Why evals? Do you need better evaluation systems on your gen AI app? You and everyone else. Measuring the performance of AI systems is the biggest blocker for putting them into production. **By implementing a strong evaluations system, you ultimately save time and get a better product out the door, faster.** However, writing good evaluations isn't easy. Human evals are expensive, programmatic evals are limited, and LLM evals are themselves in need of evaluation. ## Value you can expect - Dramatically faster prompt improvements - Fewer regressions from language model process changes - A more reliable, production-ready product by default - Confidence - Finding the love of your life - Just kidding, you're on your own for this one pal ## How to implement evals At a high level, the process for implementing evals looks like this: - Create traces for your app - Turn those traces into datasets (example inputs and outputs. You can do this with higher-level functions too, not just LLMs!) - **Observe problems, decide what metrics you should create and evaluate on. E.g. correct formatting, correctness according to experts, etc.** - Implement evaluators. Use programmatic evaluators where possible. Otherwise, LLMs as judges work well when executed carefully. - Use a comparison of human-labeled data to make sure your LLM judges are aligned. - Run your new set of evals during development, after making changes, and in production to ensure continued quality. ### Implementing traces This is covered exhaustively in various places. You can use a tool like Weights & Biases Weave to implement LLM traces in two lines of code and function traces with a simple decorator on any function. ```python # uv pip install weave import weave import your_llm_client weave.init() # every LLM call in the app is traced with this one line @weave.op # this function now tracks inputs, outputs, latency, cost, etc def respond_to_message(message: str): response = your_llm_client.generate(message) return response ``` You can find a complete guide to set this up at [https://weave-docs.wandb.ai/](http://wandb.me/from-sam-evals-post) ### Programmatic evals Any criteria of yours that can be tested programmatically, should be tested programmatically. This has key benefits, including the fact it's nearly free to run and it is reliable in a deterministic way. #### Weave Evals You'll probably want to use a tool for tracking datasets and eval runs. Weave is a tool from Weights & Biases that's a great choice for this. You can find info here, and here is the quick start code. You'll need 2 things to run evals - A dataset, basically a list of dictionaries - One or more evaluation functions, with a dict as input - A weave Model ```python # uv pip install weave from dotenv import load_dotenv import weave load_dotenv() weave.init("evals-example") # dataset dataset = [ {"input": "Apple", "output": "Fruit", "correct_answer": "Fruit"}, {"input": "Tomato", "output": "Vegetable", "correct_answer": "Fruit"}, {"input": "Carrot", "output": "Vegetable", "correct_answer": "Vegetable"} ] # evaluator # this can return a boolean, a number score, or a dict def exact_match(datapoint): return datapoint["a"] == datapoint["b"] # run the evaluator evaluation = weave.Evaluation( name='fruit_eval', dataset=examples, scorers=[exact_match] ) # llm to generate an output anthropic_api_key = os.getenv("ANTHROPIC_API_KEY") class AnthropicChatbot(weave.Model): system_prompt: str = "" model_name: str = "claude-3-haiku-20240307" def __init__(self, **data): super().__init__(**data) formatted_shoes_data = json.dumps(shoes_data, indent=2) self.system_prompt = f"""You are a fruit expert. Given one word, specify whether an input is a 'Fruit' or a 'Vegetable'. Only return that one word, with no other commentary.""" @weave.op() def predict(self, input: str) -> str: if isinstance(input, str): input = [ {"role": "system", "content": self.system_prompt}, {"role": "user", "content": input} ] client = Anthropic() response = client.messages.create( model=self.model_name, system=self.system_prompt, messages=input, ) return response.content[0].text print(asyncio.run(evaluation.evaluate(model))) ``` ##### Strict right-answer comparison / exact string comparison ```python # this can return a boolean, a number score, or a dict def exact_match(datapoint): return datapoint["a"] == datapoint["b"] ``` ##### Keyword checking ```python # check if a keyword is contained in the answer def keyword_match(datapoint): return datapoint["correct_answer"] in datapoint["output"] ``` ##### NLP tone evaluation This one can get complex, so I'll omit a code example for brevity. You can use a library like Python's NLTK. See https://www.datacamp.com/tutorial/text-analytics-beginners-nltk ##### Link checking This can take various forms, but generally it involves making sure any given link is explicitly mentioned in the prompt given. ```python import re # check to see if any link mentioned is present in a list of valid links valid_links = [ "https://store.com/about_us", "https://store.com/product1", "https://store.com/product2" ] @weave.op() def are_links_valid(model_output: str): # Use a regular expression to check for links url_pattern = r'https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&//=]*)' url_regex = re.compile(url_pattern, re.IGNORECASE) urls = url_regex.findall(model_output) # Check to see if found URLs are valid or not for url in urls: if url not in valid_links: return False return True ``` ##### JSON validation There are many libraries that assist with this. The most popular is the [[Instructor (AI)]] library which uses [[Pydantic]] under the hood. In addition to LLM provider options in tool calling and other modes like JSON mode, there's also [[BAML]], Outlines, Guidance, and many other choices. ```python import json from pydantic import BaseModel # validate it as a Pydantic object here class UserProfile(BaseModel): name: str age: int email: str def validate_user_profile(llm_output: str): try: # Attempt to parse the LLM output as JSON and validate it against the UserProfile model user_data = json.loads(llm_output) UserProfile(**user_data) return True except (json.JSONDecodeError, ValidationError): return False # Example usage: # llm_output = '{"name": "John Doe", "age": 30, "email": "[email protected]"}' # is_valid = validate_user_profile(llm_output) ``` ### LLM evals (LLM as a Judge) These are faster and cheaper than human evaluations, but often require human evaluations and oversight to become accurate. A mature AI app development workflow will include both LLM evaluations and human evaluations working in tandem. LLM as a judge is the most straightforward way to use LLMs to help with evals. You pass your criteria into a language model along with your model's outputs (and sometimes inputs). You'll typically have this output a score or a binary value, possibly with an annotation. This can be used to evaluate qualitative criteria such as: - Tone - Conciseness / verbosity - Helpfulness It can also be used to evaluate more interconnected or complex criteria, such as: - Accuracy / faithfulness to sources - Hallucinations - Similarity to a gold standard answer from a human - Self-awareness of gaps in ability - Vulnerability to red-teaming ##### Faithfulness to a gold standard answer ```python import weave from anthropic import Anthropic @weave.op() async def is_accurate(reference_output: str, model_output: str) -> dict: evaluation_prompt = """You are an expert evaluator of chatbot responses. Your task is to compare an expected output with the actual model output and determine if they convey mostly the same information. Focus on the key points and overall meaning rather than exact wording. Expected Output: {reference_output} Model Output: {model_output} Evaluate if the Model Output conveys essentially the same information as the Expected Output. Provide your verdict as a JSON object with a single key "verdict" and a value of 1 if they match closely, or 0 if they differ significantly. Output in only valid JSON format. """ client = Anthropic() response = client.messages.create( model="claude-3-opus-20240229", max_tokens=300, messages=[ {"role": "user", "content": evaluation_prompt.format( reference_output=reference_output, model_output=model_output )} ], # response_format={"type": "json_object"} ) result = json.loads(response.content[0].text) return { "verdict": result["verdict"] == 1 } ``` ##### Qualitative criteria score This is one of the more straightforward LLM judges. You can get started with a simple prompt: "You are a teacher. Review how concise the student's response is. Give it a score from 1 to 10." This is only a starting point, of course. To really get going, you need to evaluate your judge. Details in the section below. ```python @weave.op() async def is_friendly(model_output: str) -> dict: """ Evaluate the friendliness of the model's output on a scale of 0-3. Args: model_output (str): The string output from the model to be evaluated. Returns: dict: A dictionary containing the friendliness score. """ evaluation_prompt = """You are an expert in evaluating the tone and friendliness of customer service responses. Your task is to rate the following output on a scale of 0-3 for friendliness, where: 0 - Neutral: Professional but not particularly warm 1 - Polite: Shows basic courtesy and respect 2 - Nice: Warm and welcoming tone 3 - Super familiar and kind: Exceptionally friendly and personable Output to evaluate: {model_output} Provide your rating as a JSON object with a single key "friendliness_score" and a value from 0 to 3. Output in only valid JSON format. """ client = Anthropic() response = client.messages.create( model="claude-3-opus-20240229", max_tokens=300, messages=[ {"role": "user", "content": evaluation_prompt.format(model_output=model_output)} ] ) result = json.loads(response.content[0].text) return { "friendliness_score": result["friendliness_score"] } ``` ##### Faithfulness to sources / grounding The implementation of this is very dependent on the means of sourcing used, so I won't share a code snippet. There are some platforms with this built-in, such as Vertex AI. I haven't seen a particular favorite open-source implementation yet. ##### Self-awareness of gaps in ability One of the more frustrating LLM failures is hallucinating capabilities where it has none. This often includes inventing functions, making up facts about the subject matter, or claiming it took actions when it did no such thing. > [!NOTE] Evaluating your LLM judge > If you want more, here's a guide on implementing evals in [[Weave]]: [https://weave-docs.wandb.ai/tutorial-eval](http://wandb.me/tutorial-from-evals-blog) #### [[EvalGen]] (experimental) EvalGen is a meta-approach to LLM evals. It generates a complete set of criteria and LLM judges by reviewing human-annotated datasets, creating criteria based on what the human annotators find important, generating programmatic evals, and aligning its LLM judges to the human annotations. Paper here: https://arxiv.org/abs/2404.12272 This approach is new, but it's a promising attempt at a complete workflow for evals on your own tasks. If you'd like to give it a try,[[EvalForge]] is my friend [[Anish Shah]]'s implementation of it. See his code here: https://github.com/wandb/evalForge ## How to use evals to accelerate development Below is a common flow used to improve language model output in a real application. As your application gets more advanced, some of this can be automated. But in the early stages, this is a key flow for developing reliability. ```mermaid graph TD A[Test your app, or review production traces] B[Identify problems] C[Modify your prompts to fix the problem] D[Run new prompt against your evaluators to ensure no regressions] E[Add your new edge case to your dataset] A --> B B --> C C --> D D --> E E --> A ``` To recap the cycle: 1. Test your app, or review production traces 2. Identify problems 3. Modify your prompts to fix the problem 4. Run new prompt against your evaluators to ensure no regressions 5. Add your new edge case to your dataset This process helps you solve problems as they arise and simultaneously creates a dataset that prevents regressions in your production application. ![[Pasted image 20241025163759.png]] *Example evaluation between two Anthropic models in Weave* ## Challenges in the field, and research directions Despite the importance of evals, it's still not a straightforward process to get them set up. Some tools have a few evaluators built-in, but these often need significant tuning to be relevant to a given app. - Mature evals typically need a bespoke setup. - LLM evaluators are by default unreliable. - Criteria for the evaluation of an app change over time. - The choice of criteria itself is an important consideration. - In production, there are too many traces to run evals on every message. How to choose which ones get evaluated and on what criteria? - When comparing models, different prompt styles can work better for different models. How can you do more apples-to-apples comparisons? DSPy? - It's not straightforward to combine human and LLM evaluators. #### Mature evals need a bespoke setup Built-in evaluators for some platforms work as a starting point, but domain-specific evaluators quickly become necessary. How can we make building and tuning those evaluators easy? #### LLM evaluators default as unreliable You can't one-shot a good LLM judge, generally speaking. You can write a prompt, and the prompt can be a very good starting point, especially if a problem is obvious. However, more qualitative or less straightforward criteria require tuning. How can we make this magically work out of the box? Some attempts: - [[DSPy]] evaluators, and other self-optimizing prompt systems - [[EvalGen]], and other criteria-generating and prompt-optimizing systems #### Criteria for the evaluation of an app change over time As an application develops, its users grow and change, and so do their needs. How can we make sure the criteria stay updated and that we are still accurately evaluating them? #### Choosing the right criteria can be unintuitive What factors are going to be important for good function of the app? How is it going to change in the future? When edge cases are not covered right now, that should be? #### In production, there are too many traces to run evals on every message. How to choose which ones get evaluated and on what criteria? Some platforms have implemented a sampling system which will evaluate some percentage of incoming traces to the user set. This works as a starting point, but would it be better to target evaluations that show some warning signs? For eg., if the user sentiment for something is bad, can we target those messages for evaluation, those generations? Dynamic choice on what to evaluate could make things more efficient and identify problems more quickly. #### When comparing models, different prompt styles can work better for different models. Optimizing latency and cost by trying different models is a common task. However, a prompt that is optimal for one model may not be ideal for another. How can we enable more apples-to-apples comparisons? [[DSPy]]? #### It's not straightforward to combine human and LLM evaluators Any mature evaluation system requires collaboration between humans and LLMs. However, there are many ways to do this, and they require a lot of setup and complexity. How can we make this dramatically easier and more reliable? [[EvalGen]] is an attempted solution, but is currently more experimental. ## Go forth and build evals 🫡 You're now up to speed on the basics. You can check out the further reading section below to find ways to implement the above and explore more experimental techniques. If you have questions or want to chat about it, feel free to reach out to me over [Twitter (@sammakesthings)](https://twitter.com/sammakesthings) ## Further reading #### Guides & courses Getting started guide for the Weave tool: [https://weave-docs.wandb.ai/](http://wandb.me/from-sam-evals-post) [[Anthropic]] course on evals: https://github.com/anthropics/courses/tree/master/prompt_evaluations *It's brief but densely written. High-value course.* Thorough guide on working with LLMs generally, from some excellent names: https://applied-llms.org/ (Eugene Yan, Bryan Bischoff, Charles Frye, Jason Liu, to name a few) #### Interesting papers A Survey of Useful LLM Evaluation https://arxiv.org/html/2406.00936v1#bib.bib149 Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation https://arxiv.org/abs/2402.11443