Tags
tech/ml/generative-ai
Mindset Change
- Shift from code mindset to data mindset.
- Just like TDD, do eval-driven development.
- Look at data daily and refine the system to improve coverage gradually.
- It’s a game of experimentation & iterations. It's not a one time build.
Flow
Process
- Build an eval dataset → expected inputs & ideal outputs sourced from your actual documents (a.k.a. Do it manually first)
- Build RAG (retrieval-augmented generation) to generate context
- Evaluate models for your use case
- Run different combinations of the system (RAG + model + prompt engineering) and rank them by comparing the results with the eval dataset
- Deploy the best combination to production with guardrails
- Analyze the logs to find patterns to (1) improve the eval dataset, (2) quality of context or (3) quality of model (e.g. fine-tuning)
- Go to Step 5 and iterate
Quality of context (RAG) is as important as the quality of the model.
Q&A
- What is evals?
- Evals is analogous to test-driven development in software engineering
- Why are evals important?
- Because there is no other way to systematically score the accuracy of the system for your use case.
- Unlike deterministic sofware engineering, trying a few random prompts is not going to ensure that the system works well across various inputs in production.
- Why should we build evals first?
- To validate that your idea is feasible and practical
- To create a benchmark what is the accuracy that you need
- To compare the system (RAG + model + prompt) with your ideal benchmark, which helps you understand where to improve the system
- To ensure that you’re not making the system worse when you’re making changes
- To increase your speed of experimenting with new combinations of systems and iterating towards better accuracy, speed, cost.
- When to fine-tune
- Only after you have an eval dataset + exhausted prompt engineering (see Hamel’s note)
- Fine Tuning Is For Form, Not Facts | Anyscale
- Why fine-tune?
- To reduce cost by using a smaller model
- To increase accuracy of a smaller model
- To reduce latency because shorter prompts at inference time
- To reduce cost because shorter prompts at inference time
- When to use a GenAI API Gateway?
- Enterprise: If you are in a large enterprise company context, with hundreds of use cases across different model vendor APIs & different open source models, and you want to enforce best practices such as cost attribution, logging, guardrails, etc.
- Solo Dev Mode: If you don’t want to change your code every time you want to try a new model. Bonus: Automatic logging of all requests & responses, to inspect later.
- See references below
Links
- Introduction to LLMs
- https://www.futureofai.mit.edu/
- https://github.com/microsoft/generative-ai-for-beginners
- https://github.com/mlabonne/llm-course
- Overview of LLMOps
- The architecture of today's LLM applications - The GitHub Blog
- Emerging Architectures for LLM Applications | Andreessen Horowitz
- Foundation Model Ops: Powering the Next Wave of Generative AI Apps - Foundation Capital
- Navigating the LLMops landscape: What you need to know | Insight Partners
- Patterns for Building LLM-based Systems & Products — Eugene Yan
- https://book.premai.io/state-of-open-source-ai/
- https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e
- https://martinfowler.com/articles/engineering-practices-llm.html
- Open-Source AI Cookbook - Hugging Face
- Caution
- Still early technology
- https://blog.matt-rickard.com/p/autonomous-llm-agents-are-10-years
- https://blog.matt-rickard.com/p/where-ai-fits-in-engineering-organizations
- GenAI as a compound system, not a single model
- https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
- Generative AI Lifecycle Patterns — Ali Arsanjani
- AI is a triumvirate of data, algorithms, and hardware
- Prompt Engineering
- https://aman.ai/primers/ai/prompt-engineering/
- https://hamel.dev/blog/posts/prompt/
- https://github.com/Eladlev/AutoPrompt
- https://twitter.com/LangChainAI/status/1762519058912838048?t=kVfGu8SGwfLdKl6gtG-PfA&s=19
- Evals
- https://hamel.dev/blog/posts/evals/
- https://slides.com/andreilopatenko/deck-6eb5fe (via)
- https://github.com/openai/evals
- Ragas: Building the Opensource standard for evaluating LLM application
- https://github.com/promptfoo/promptfoo
- https://github.com/truera/trulens
- https://docs.arize.com/phoenix/evaluation/llm-evals
- https://www.vellum.ai/blog/introducing-vellum-evaluations
- https://wandb.ai/wandbot/wandbot_public/reports/Evaluation-Driven-Development-Improving-WandBot-our-LLM-Powered-Documentation-App--Vmlldzo2NTY1MDI0
- https://docs.rungalileo.io/galileo/galileo-gen-ai-studio/llm-studio
- https://twitter.com/eugeneyan/status/1764066697454182592
- https://uptrain.ai
- Judge LLM
- RAG
- https://aws.amazon.com/what-is/retrieval-augmented-generation/
- LangChain: https://python.langchain.com/docs/get_started/introduction
- LlamaIndex: https://docs.llamaindex.ai/en/stable/
- https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/
- https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6
- Dataset collection for Fine-tuning
- Fine-tuning
- https://github.com/OpenAccess-AI-Collective/axolotl
- https://github.com/unslothai/unsloth
- https://github.com/huggingface/autotrain-advanced
- https://github.com/BatsResearch/bonito
- https://fireworks.ai/blog/fine-tune-launch
- https://smol.ai/
- Guardrails
- LLM API Gateway
- The AI Gateway Pattern for Enterprises - by John Hwang
- Create a Generative AI Gateway to allow secure and compliant consumption of foundation models | AWS Machine Learning Blog
- Designing and implementing a gateway solution with Azure OpenAI resources | Microsoft Learn
- https://huyenchip.com/2024/02/28/predictive-human-preference.html
- https://portkey.ai/
- https://litellm.ai/
- https://withmartian.com/
- Security
- https://www.ycombinator.com/companies/promptarmor
- https://safety.google/cybersecurity-advancements/saif/
- https://github.com/greshake/llm-security
- Model Serving
While GenAI is fun, I think its economic value is grossly over estimated, because it’s unreliable, risky and expensive to make and serve. It’s fine for creative tasks, but not (yet) autonomous agents — via
Models to start with
- Paid Foundation Models
- Open-weights Foundation Models
- https://www.baseten.co/blog/the-best-open-source-large-language-model/
- Meta’s Llama 2
- Mistral model (via)
- Multimodal - text, image, audio, video
Advanced
- Patterns
- Foundation Model Training
- https://fmcheatsheet.org/
- MosaicML Composer (via)