Uber's Michelangelo vs. Netflix's Metaflow

 

Uber's Michelangelo vs. Netflix's Metaflow

Michelangelo

Pain point

Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc.


Solution

Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibility for all other ML use cases.


It’s built into 3 main parts:

  1. Control Plane - ML Engineers / Data Scientists / Applied Scientists interact with this layer to do their work.  It’s kind of like the frontend layer of the ML Platform

  2. Offline - Model training, evaluation, tuning/autoML, batch inference, running large scale jobs

  3. Online - live inference, production user interactions


Most recently they’ve added lots of features to make LLM development easier, such as integrating with huggingface to make open source LLMs accessible to use and fine tune, and prompt engineering environments to iterate on.


With Michelangelo, running workloads on Ray/Spark and just getting an ML project off the ground is no longer a heavy lift.  And maintaining an ML project is manageable for product teams.


Metaflow

Pain Point

After prototyping, product teams need to ship their ML projects to production.  Doing so can be very time consuming because of the variety of systems each project needs to integrate with in order to ship to users.


Product teams and ML engineers already have enough technologies they need to stay up-to-date on - adding in all the Netflix production dependencies and integrations required to ship a project to prod is overwhelming and a waste of ML engineer mindshare, when that could be handled and managed for them centrally.


There are a few key types of systems that need to be deployed to:

  • Cached batch inference-style data/KV API’s

  • GPU-backed live inference APIs


And there are also systems that are not live or user-facing (user could be internal creatives or actual Netflix users) but need to be integrated with to allow for engineering progress:

  • workflow/compute orchestration layer

  • Knowledge graph

  • Explainer infra


Solution

Metaflow provides a user-friendly API and integrates with all of the most important systems on the path from ML idea to user-facing product/feature.  It allows for extensions to be written by practitioners, and is an open-source project too.


Metaflow integrates with:

  • (Fast) Data - there’s a software layer on top of the main data lake (S3 Iceberg tables) that find and pulls the correct data, and another layer (Apache Arrow) that efficiently (zero-copy) converts the data to streamable frames and allows for user code to process it

  • Compute - e.g. there’s a layer that gives metadata about a model and the env it was trained in, so that explainer models can be built

  • Orchestration - it allows ‘flows’ to be triggered in an event-driven style, so any user code can ‘trigger’ a Metaflow ‘flow’, and any ‘flow’ can trigger another ‘flow’

  • Production - it allows various paths for deploying to production


Compare & Contrast Michelangelo and Metaflow

Metaflow integrates fast data streaming, Michelangelo doesn’t.  Metaflow focuses on common compute/data primitives, Michelangelo goes higher up the stack to common ML tasks like autotuning, evaluation and general frontend UX for ML engineers/DS/AS.  Michelangelo solves for resource allocation/prioritization and capacity efficiency via sharing between different teams, Metaflow doesn’t.


Key Differences


michelangelo

metaflow

Key differentiator

Resource sharing, unified UI

data streaming, model explainers, event-driven

Architecture

Control plane, offline, online

storage(S3,etc.), data streaming, event-driven computation DAGs

Fast data/last-mile data processing

???

S3-iceberg,parquet, metaflow.Table to find/load, MetaflowDataFrame/Arrow to stream

Compute

K8s, custom CRD controller, Spark & ray

titus(k8s), spark for ETL, dependency management (user-friendly layer on top of docker)

orchestration

resource sharing x-team

* Maestro workflows - DAGs.  it’s the metaflow project backbone.
* Event-driven arch

Model hosting

* feature transformation graphs bundled with model graphs for deployment (improves train/serve skew)
* Gen AI Gateway?

* Metaflow hosting - models/artifacts from metaflow deployed here.

* Autoscaling, ops/observability

cloud

OCI, GCP

Probably AWS

scale

* 5k GPUs, 400 projects, 20K training jobs/mo.

* 5K models in prod, 10M QPS peak

* ??? GPUs, hundreds of projects

UX

* MA Studio - 1 unified tool, Gen AI Gateway (newer)
* User submits jobs

* Human-friendly APIs
* User defines DAGs and triggers them with events

Governance

* For models only: auditing, policy guardrails, PII redaction

* (does data-level governance live elsewhere?)

???


Key Similarities

  • Consolidation of engineering effort for running ML jobs (platform)

  • User-friendliness

    • Michelangelo uses MA Studio UI

    • Metaflow uses a human-friendly API

  • Some overlap in ‘primitives’ offered (compute, data, workflows


Summary

Uber’s Michelangelo and Netflix’s Metaflow illustrate two viable, yet opposing, theories of ML platforms: unified ML experience vs. pluggable compute/data primitives.  Here’s where they are similar, where they contrast and where they both falter.


A consolidated ML platform is something both systems agree is necessary and valuable.  Engineering effort required to…

  • setup access to compute and data

  • request/provision/allocate compute resources

  • Orchestrate workloads

  • Allow human-friendly observability of job state / progress / history

…is non-trivial and having individual teams do this each on their own is wasteful.  Having a platform is no longer a differentiator for an organization's ML teams, but a requirement at a certain scale.


But the devil is in the details - each organization is different, and these 2 systems have evolved to prioritize serving other common ML features/concepts differently.  Michelangelo excels at higher-level abstractions such as a unified UI and tracking/sharing idle compute resources (GPUs) across teams.  Metaflow delivers on more robust distributed systems primitives such as its data streaming framework and an event-driven architecture.  Metaflow also provides support for model explainers, a key use case for Netflix.


The one area neither of these systems (or at least the blog posts about them) touches on is data governance.  Michelangelo has features for model governance, but it appears these features were added later and may not pertain to the data used to train those models.  Metaflow has no mention of security, policy or audit trails.  Governance can feel boring to many, but in any organization large enough to have an ML Platform, it’s likely an important topic and it’s a shame the systems don’t go deeper on it.


Still, Michelangelo and Metaflow are excellent examples of ML Platforms at large organizations.


Further Questions

  • Feature store parity – Michelangelo’s Palette is front-and-center; Metaflow leans on Iceberg + Fast Data. Do they solve the same latency/ownership pain, or is one focused on engineering reuse and the other on developer velocity?

Comments

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture