Mindset Change

Shift from code mindset to data mindset.
Just like TDD, do eval-driven development.
Look at data daily and refine the system to improve coverage gradually.
It’s a game of experimentation & iterations. It's not a one time build.

Flow

Dev mode is RAG + Prompt + GenAI Model

Production mode adds guardrails, GenAI API Gateway

Process

Build an eval dataset → expected inputs & ideal outputs sourced from your actual documents (a.k.a. Do it manually first)
Build RAG (retrieval-augmented generation) to generate context
Evaluate models for your use case
Run different combinations of the system (RAG + model + prompt engineering) and rank them by comparing the results with the eval dataset
Deploy the best combination to production with guardrails
Analyze the logs to find patterns to (1) improve the eval dataset, (2) quality of context or (3) quality of model (e.g. fine-tuning)
Go to Step 5 and iterate

Quality of context (RAG) is as important as the quality of the model.

Q&A

What is evals?

Evals is analogous to test-driven development in software engineering

Why are evals important?

Because there is no other way to systematically score the accuracy of the system for your use case.
Unlike deterministic sofware engineering, trying a few random prompts is not going to ensure that the system works well across various inputs in production.

Why should we build evals first?

To validate that your idea is feasible and practical
To create a benchmark what is the accuracy that you need
To compare the system (RAG + model + prompt) with your ideal benchmark, which helps you understand where to improve the system
To ensure that you’re not making the system worse when you’re making changes
To increase your speed of experimenting with new combinations of systems and iterating towards better accuracy, speed, cost.

When to fine-tune

Only after you have an eval dataset + exhausted prompt engineering (see Hamel’s note)
Fine Tuning Is For Form, Not Facts | Anyscale

Why fine-tune?

To reduce cost by using a smaller model
To increase accuracy of a smaller model
To reduce latency because shorter prompts at inference time
To reduce cost because shorter prompts at inference time

When to use a GenAI API Gateway?

Enterprise: If you are in a large enterprise company context, with hundreds of use cases across different model vendor APIs & different open source models, and you want to enforce best practices such as cost attribution, logging, guardrails, etc.
Solo Dev Mode: If you don’t want to change your code every time you want to try a new model. Bonus: Automatic logging of all requests & responses, to inspect later.
See references below

Links

Introduction to LLMs

Overview of LLMOps

Caution

Still early technology

While GenAI is fun, I think its economic value is grossly over estimated, because it’s unreliable, risky and expensive to make and serve. It’s fine for creative tasks, but not (yet) autonomous agents — via