Skip to content


Who is this for?

This article is written for backend engineers and engineering managers looking to understand the world of engineering around machine learning.

We will not discuss data science or machine learning itself. The focus is on the engineering around it.

Why is this important?

Why is this hard?



What is ML? (A quick refresher)

Machine Learning (ML) is learning patterns (a "model") from data, and applying that pattern on new data.

In concept:


from Book : Deep Learning with Python

In practice:


source of image

To learn more, see the Hands-On Machine Learning book.

What is MLOps?

MLOps = Machine Learning Operations = how to operate Machine Learning in production.

From Google Cloud on MLOps:

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management.

From "Hidden Technical Debt in Machine Learning Systems" paper (Google):


From Chip Huyen:

Deploying ML systems isn't just about getting ML systems to the end-users. It's about building an infrastructure so the team can be quickly alerted when something goes wrong, figure out what went wrong, test in production, roll-out/rollback updates.

Design For Your Company Culture

Before getting started, think of these various aspects, since the solution space will change depending on your views on these aspects:

Standardization or Flexibility?

Which of these situations is preferable:

  • We should have few chosen technologies, few chosen languages, few chosen tools, and build on those, so that we have shared knowledge and shared tooling investments across the company.
  • Any Data Scientist / Machine Learning Engineer should be free to pick the technology, pick the language and pick the tools, so that we have flexibility and no barriers across the company.

e.g. Flexibility usually means building on top of Docker containers, etc.

Cloud or Vendor or Open Source?

Which of these situations is preferable:

  • We are heavily invested into a single cloud, and we are looking to just get started quickly with ML in production.
  • We would prefer to offload operations to a vendor, but not be locked into a single cloud, and we want to focus on the product.
  • We would prefer to not be locked into a single cloud, not depend on vendors, and we are inclined to invest in the time needed to build knowledge on how to operate the relevant open source software.

e.g. How DoorDash is Scaling its Data Platform


Imagine this situation: Employee X builds a few features (F1, F2, F3), ships a model to production, and six months later leaves the company. F2 starts breaking in production (maybe the upstream raw data column has changed from int to string). Downstream users, who own the microservice that asks this model for predictions (via an API), are getting paged. Now who is going to fix F2?

e.g. Data Quality at Airbnb

What does MLOps contain?

From Great Expectations blog:


From The MLOps Stack:


Hierarchy of Needs

Where to find raw data?

How to convert raw data into features?

Where to store & fetch features?

  • Feature Naming
  • Data Version Control
  • Feature Store

    • What is a Feature Store? | Tecton




    • Podcast : Feature Stores for MLOps with Mike del Balso | TWIML

    • Feast (open source) + Tecton (commercial)
      • A state of Feast


        we are excited to share that Feast is now a component in Kubeflow

        Lessons Learned

        Feast requires too much infrastructure: Requiring users provision a large system is a big ask. A minimal Feast deployment requires Kafka, Zookeeper, Postgres, Redis, and multiple Feast services.

        Feast lacks composability: Requiring all infrastructural components be present in order to have a functional system removes all modularity.

        Ingestion is too complex: Incorporating a Kafka-based stream-first ingestion layer trivializes data consistency across stores, but the complete ingestion flow from source to sink can still mysteriously fail at multiple points.

        Our technology choices hinder generalization: Leveraging technologies like BigQuery, Apache Beam on Dataflow, and Apache Kafka has allowed us to move faster in delivering functionality. However, these technologies now impede our ability to generalize to other clouds or deployment environments.


        The lessons we’ve learned during the preceding two years have crystallized a vision for what Feast should become: a light-weight modular feature store. One that’s easy to pick up, adds value to teams large and small, and can be progressively applied to production use cases that span multiple teams, projects, and cloud-environments. We aim to reach this by applying the following design principles:

        1. Python-first: First-class support for running a minimal version of Feast entirely from a notebook, with all infrastructural dependencies becoming optional enhancements.

        • Encourages quick evaluation of the software and ensures Feast is user friendly
        • Minimizes the operational burden of running the system in production
        • Simplifies testing, developing, and maintaining Feast

        2. Production-ready: A collection of battle-tested components built for production.

        • Manages high-scale operational workloads for both training and serving
        • Integrates with industry-standard monitoring systems for both data and services
        • Provides a simplified architecture that facilitates diagnostics and debugging

        3. Composability: Modular components with clear extension, integration, and upgrade points that allow for high composability.

        • Grants teams the flexibility to adopt specific Feast components
        • Incentivizes defining clear component boundaries and data contracts
        • Eliminates barriers on teams intending to swap in their existing technologies

        4. Cloud-agnostic: Removal of all hard coupling to cloud-specific services, and inclusion of portable technologies like Apache Spark for data processing and Parquet for offline storage.

        • Enables deployment into all cloud and on-premise environments
        • Introduces a rich set of storage and integration options through Spark I/O
        • Improves development velocity by allowing all infrastructure to run locally

        Next Steps

        Our vision for Feast is not only ambitious, but actionable. Our next release, Feast 0.8, is the product of collaborating with both our open source community and our friends over at Tecton.

        1. Python-first: We are migrating all core logic to Python, starting with training dataset retrieval and job management, providing a more responsive development experience.
        2. Modular ingestion: We are shifting to managing batch and streaming ingestion separately, leading to more actionable metrics, logs, and statistics and an easier to understand and operate system.
        3. Support for AWS: We are replacing GCP-specific technologies like Beam on Dataflow with Spark and adding native support for running Feast on AWS, our first steps toward cloud-agnosticism.
        4. Data-source integrations: We are introducing support for a host of new data sources (Kinesis, Kafka, S3, GCS, BigQuery) and data formats (Parquet, JSON, Avro), ensuring teams can seamlessly integrate Feast into their existing data-infrastructure. - [ ] - [ ] Why Tecton is Backing the Feast Open Source Feature Store

How to develop models?

How to train models?

How to serve models?

How to monitor models?

How to iterate on models?

How do I get started with MLOps?



Design the phases of building your ML Platform

Using Vendors

Using Open Source

In case you want to avoid vendor lock-in, and prefer open source tools:

Obligatory hiring pitch: If you have read all the way till here, you must be interested in working on MLOps, so come work with me in the ML Platform team at DoorDash! 😄