It started as a course on fine-tuning LLMs by Hamel Husain & Dan Becker. It then blew up to become a full conference on mastering LLMs, and in that process, I couldn’t keep up with the talks as they happened. So I finally caught up watching, several of the talks were inspiring and insightful. These were my takeaways:

LLMs are dumb mechanical humans

From Prompt Engineering Workshop w/ John Berryman:

Evals is a must

From Dan & Hamel in Fine-Tuning Workshop 3:

Your AI Product Needs Evals – Hamel’s Blog.

Unit tests to catch basic issues, like handling 0, 1, 2 output counts.

LLM-as-a-judge needs alignment, i.e. a mini-evaluation as well.

LLM-as-a-judge can be misleading (swap A & B, and does the boolean answer flip?).

Human evals can also be misleading (expectations can increase over time), so use A/B testing.

From Spellgrounds for Prodigious Prestidigitation by Bryan Bischof:

Evals are for:

They help you understand when a capability is good enough to present to customers
They help you sleep well at night, i.e. they give you confidence that your system won’t behave badly
They help you debug later when things go awry

Things to avoid: