Who is this for?¶
This article is written for backend engineers and engineering managers looking to understand the world of engineering around machine learning.
We will not discuss data science or machine learning itself. The focus is on the engineering around it.
Why is this important?¶
- Because machine learning is software 2.0.
Every engineer will need to learn to use ML eventually. Just like whenever we think of adding search, we now reach out to use ElasticSearch.
Why is this hard?¶
- Why 90 percent of all machine learning models never make it into production
In many ways, machine learning systems are the most complex systems we’ve seen to date, and these complex systems have become mission-critical to many companies. Uber depends on algorithms to automatically predict and price demand for rides. Netflix’s recommendation engine powers its core user experience. And the performance of the AI models in Tesla self-driving cars can save lives. The list goes on.
What is ML? (A quick refresher)¶
Machine Learning (ML) is learning patterns (a "model") from data, and applying that pattern on new data.
To learn more, see the Hands-On Machine Learning book.
What is MLOps?¶
MLOps = Machine Learning Operations = how to operate Machine Learning in production.
From Google Cloud on MLOps:
MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management.
From Chip Huyen:
Deploying ML systems isn't just about getting ML systems to the end-users. It's about building an infrastructure so the team can be quickly alerted when something goes wrong, figure out what went wrong, test in production, roll-out/rollback updates.
Design For Your Company Culture¶
Before getting started, think of these various aspects, since the solution space will change depending on your views on these aspects:
Standardization or Flexibility?¶
Which of these situations is preferable:
- We should have few chosen technologies, few chosen languages, few chosen tools, and build on those, so that we have shared knowledge and shared tooling investments across the company.
- Any Data Scientist / Machine Learning Engineer should be free to pick the technology, pick the language and pick the tools, so that we have flexibility and no barriers across the company.
e.g. Flexibility usually means building on top of Docker containers, etc.
Cloud or Vendor or Open Source?¶
Which of these situations is preferable:
- We are heavily invested into a single cloud, and we are looking to just get started quickly with ML in production.
- We would prefer to offload operations to a vendor, but not be locked into a single cloud, and we want to focus on the product.
- We would prefer to not be locked into a single cloud, not depend on vendors, and we are inclined to invest in the time needed to build knowledge on how to operate the relevant open source software.
Imagine this situation: Employee X builds a few features (F1, F2, F3), ships a model to production, and six months later leaves the company. F2 starts breaking in production (maybe the upstream raw data column has changed from int to string). Downstream users, who own the microservice that asks this model for predictions (via an API), are getting paged. Now who is going to fix F2?
What does MLOps contain?¶
From Great Expectations blog:
From The MLOps Stack:
Hierarchy of Needs¶
Where to find raw data?¶
Why is this one of the most important steps?
- Batch data
- Streaming data
How to convert raw data into features?¶
- Feature Engineering
Pipelines (Batch Features)
- Open Source
- 3-min demo of Data pipelines in Spotify Backstage
- KubeFlow Pipelines
- Apache Airflow | Airbnb
- Argo Workflows
- Prefect Core
Bronze, Silver, Gold tables (Medallion architecture)
- Open Source
Where to store & fetch features?¶
- Feature Naming
- Data Version Control
- Feast (open source) + Tecton (commercial)
we are excited to share that Feast is now a component in Kubeflow
Feast requires too much infrastructure: Requiring users provision a large system is a big ask. A minimal Feast deployment requires Kafka, Zookeeper, Postgres, Redis, and multiple Feast services.
Feast lacks composability: Requiring all infrastructural components be present in order to have a functional system removes all modularity.
Ingestion is too complex: Incorporating a Kafka-based stream-first ingestion layer trivializes data consistency across stores, but the complete ingestion flow from source to sink can still mysteriously fail at multiple points.
Our technology choices hinder generalization: Leveraging technologies like BigQuery, Apache Beam on Dataflow, and Apache Kafka has allowed us to move faster in delivering functionality. However, these technologies now impede our ability to generalize to other clouds or deployment environments.
The lessons we’ve learned during the preceding two years have crystallized a vision for what Feast should become: a light-weight modular feature store. One that’s easy to pick up, adds value to teams large and small, and can be progressively applied to production use cases that span multiple teams, projects, and cloud-environments. We aim to reach this by applying the following design principles:
1. Python-first: First-class support for running a minimal version of Feast entirely from a notebook, with all infrastructural dependencies becoming optional enhancements.
- Encourages quick evaluation of the software and ensures Feast is user friendly
- Minimizes the operational burden of running the system in production
- Simplifies testing, developing, and maintaining Feast
2. Production-ready: A collection of battle-tested components built for production.
- Manages high-scale operational workloads for both training and serving
- Integrates with industry-standard monitoring systems for both data and services
- Provides a simplified architecture that facilitates diagnostics and debugging
3. Composability: Modular components with clear extension, integration, and upgrade points that allow for high composability.
- Grants teams the flexibility to adopt specific Feast components
- Incentivizes defining clear component boundaries and data contracts
- Eliminates barriers on teams intending to swap in their existing technologies
4. Cloud-agnostic: Removal of all hard coupling to cloud-specific services, and inclusion of portable technologies like Apache Spark for data processing and Parquet for offline storage.
- Enables deployment into all cloud and on-premise environments
- Introduces a rich set of storage and integration options through Spark I/O
- Improves development velocity by allowing all infrastructure to run locally
Our vision for Feast is not only ambitious, but actionable. Our next release, Feast 0.8, is the product of collaborating with both our open source community and our friends over at Tecton.
- Python-first: We are migrating all core logic to Python, starting with training dataset retrieval and job management, providing a more responsive development experience.
- Modular ingestion: We are shifting to managing batch and streaming ingestion separately, leading to more actionable metrics, logs, and statistics and an easier to understand and operate system.
- Support for AWS: We are replacing GCP-specific technologies like Beam on Dataflow with Spark and adding native support for running Feast on AWS, our first steps toward cloud-agnosticism.
- Data-source integrations: We are introducing support for a host of new data sources (Kinesis, Kafka, S3, GCS, BigQuery) and data formats (Parquet, JSON, Avro), ensuring teams can seamlessly integrate Feast into their existing data-infrastructure. - [ ] https://blog.feast.dev/post/what-is-a-feature-store - [ ] Why Tecton is Backing the Feast Open Source Feature Store
How to develop models?¶
- Simplify the experience for Data Scientists
- Notebooks + Clusters
- Experiments Management
How to train models?¶
- Notebooks & Clusters
- Training pipeline
- Real-time online learning
How to serve models?¶
- Prediction service : API
- Load model
- Take input features in the request
- Gather feature set from feature store + features from request
- Model inference using combined feature set
- Return response
- Monitor performance, scalability, cost like any other API service
- Company experiences
How to monitor models?¶
- Log prediction scores within the API service
- A simple solution for monitoring ML systems | Jeremy Jordan
- The Rise of ML Ops: Why Model Performance Monitoring Could Be the Next Billion-Dollar Industry
- 6 Little-Known Challenges After Deploying Machine Learning | Eugene Yan
- Value Propositions of a Great ML Monitoring System
- Why You Should Care About Data and Concept Drift | Evidently AI
- whylogs: Embrace Data Logging Across Your ML Systems
- Data Observability Platform
How to iterate on models?¶
How do I get started with MLOps?¶
- MLOps Maturity | Google Cloud
- Paper : Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)
- A Brief Guide to Running ML Systems in Production - Best Practices for Site Reliability Engineers | O'Reilly
- Book : The Self-Service Data Roadmap
- Continous Delivery Foundation's MLOps Roadmap 2020
- Applying the MLOps Lifecycle | Seldon
- GitOps for ML using Kubernetes
- Data Engineer Roadmap
Design the phases of building your ML Platform¶
- Designing ML Orchestration Systems for Startups
- DoorDash’s ML Platform – The Beginning
- Michelangelo | Uber
- 5 Lessons Learned Building an Open Source MLOps Platform | Cortex
- AWS Sagemaker
- Google Cloud AI Platform
- AI Infrastructure Alliance
- Machine learning startups generally have no moat or meaningful special sauce
Using Open Source¶
In case you want to avoid vendor lock-in, and prefer open source tools:
Bet on Kubernetes
- Bet on Apache Kafka or Apache Pulsar
- Kubeflow 1.0 announcement : ML Platform | Kubeflow blog
- Kubeflow: Not Yet Ready for Production? Jan 4, 2021
- Seldon Core : ML Server
- Flyte : ML Platform | Lyft
- Apache Liminal
- TensorFlow Extended (TFX) | Google
- PyTorch Lightning
- Allegro AI
- Cortex prediction server
- Amundsen : data discovery and metadata | Lyft
- Feast : feature store | GoJek & Google Cloud
- Linux Foundation's AI Foundation projects
- Kubeflow 1.0 announcement : ML Platform | Kubeflow blog
What to Read Next?¶
- Book : Building Machine Learning Powered Applications : teaches product life-cycle for ML 💯
- Book : Introducing MLOps
- ML Systems Design Interview Guide | Patrick Halina
- Book : Machine Learning Design Patterns
- Machine Learning Systems Design | Chip Huyen
- Book : Building Machine Learning Pipelines
- Book : Kubeflow for Machine Learning
- Book : Self-Service Data Roadmap
- Book : Designing Data-Intensive Applications
- Book : The DevOps Handbook
- Book : Observability Engineering
- Newsletter : ML in Production
- Videos : Stanford MLSys Seminar Series
- Course : Full Stack Deep Learning
- Awesome MLOps
- LF AI & Data Foundation Interactive Landscape
- Awesome Production Machine Learning
- Podcast : twiml
- 100 open source Big Data and ML architecture papers for data professionals
Obligatory hiring pitch: If you have read all the way till here, you must be interested in working on MLOps, so come work with me in the ML Platform team at DoorDash! 😄