Uber's Michelangelo vs. Netflix's Metaflow

Michelangelo

Pain point

Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc. It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn. In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc.

Solution

Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibility for all other ML use cases.

It’s built into 3 main parts:

Control Plane - ML Engineers / Data Scientists / Applied Scientists interact with this layer to do their work. It’s kind of like the frontend layer of the ML Platform
Offline - Model training, evaluation, tuning/autoML, batch inference, running large scale jobs
Online - live inference, production user interactions

Most recently they’ve added lots of features to make LLM development easier, such as integrating with huggingface to make open source LLMs accessible to use and fine tune, and prompt engineering environments to iterate on.

With Michelangelo, running workloads on Ray/Spark and just getting an ML project off the ground is no longer a heavy lift. And maintaining an ML project is manageable for product teams.

Metaflow

Pain Point

After prototyping, product teams need to ship their ML projects to production. Doing so can be very time consuming because of the variety of systems each project needs to integrate with in order to ship to users.

Product teams and ML engineers already have enough technologies they need to stay up-to-date on - adding in all the Netflix production dependencies and integrations required to ship a project to prod is overwhelming and a waste of ML engineer mindshare, when that could be handled and managed for them centrally.

There are a few key types of systems that need to be deployed to:

Cached batch inference-style data/KV API’s
GPU-backed live inference APIs

And there are also systems that are not live or user-facing (user could be internal creatives or actual Netflix users) but need to be integrated with to allow for engineering progress:

workflow/compute orchestration layer
Knowledge graph
Explainer infra

Solution

Metaflow provides a user-friendly API and integrates with all of the most important systems on the path from ML idea to user-facing product/feature. It allows for extensions to be written by practitioners, and is an open-source project too.

Metaflow integrates with:

(Fast) Data - there’s a software layer on top of the main data lake (S3 Iceberg tables) that find and pulls the correct data, and another layer (Apache Arrow) that efficiently (zero-copy) converts the data to streamable frames and allows for user code to process it
Compute - e.g. there’s a layer that gives metadata about a model and the env it was trained in, so that explainer models can be built
Orchestration - it allows ‘flows’ to be triggered in an event-driven style, so any user code can ‘trigger’ a Metaflow ‘flow’, and any ‘flow’ can trigger another ‘flow’
Production - it allows various paths for deploying to production

Compare & Contrast Michelangelo and Metaflow

Metaflow integrates fast data streaming, Michelangelo doesn’t. Metaflow focuses on common compute/data primitives, Michelangelo goes higher up the stack to common ML tasks like autotuning, evaluation and general frontend UX for ML engineers/DS/AS. Michelangelo solves for resource allocation/prioritization and capacity efficiency via sharing between different teams, Metaflow doesn’t.

Key Differences

	michelangelo	metaflow
Key differentiator	Resource sharing, unified UI	data streaming, model explainers, event-driven
Architecture	Control plane, offline, online	storage(S3,etc.), data streaming, event-driven computation DAGs
Fast data/last-mile data processing	???	S3-iceberg,parquet, metaflow.Table to find/load, MetaflowDataFrame/Arrow to stream
Compute	K8s, custom CRD controller, Spark & ray	titus(k8s), spark for ETL, dependency management (user-friendly layer on top of docker)
orchestration	resource sharing x-team	* Maestro workflows - DAGs. it’s the metaflow project backbone. * Event-driven arch
Model hosting	* feature transformation graphs bundled with model graphs for deployment (improves train/serve skew) * Gen AI Gateway?	* Metaflow hosting - models/artifacts from metaflow deployed here. * Autoscaling, ops/observability
cloud	OCI, GCP	Probably AWS
scale	* 5k GPUs, 400 projects, 20K training jobs/mo. * 5K models in prod, 10M QPS peak	* ??? GPUs, hundreds of projects
UX	* MA Studio - 1 unified tool, Gen AI Gateway (newer) * User submits jobs	* Human-friendly APIs * User defines DAGs and triggers them with events
Governance	* For models only: auditing, policy guardrails, PII redaction * (does data-level governance live elsewhere?)	???

Key Similarities

Consolidation of engineering effort for running ML jobs (platform)
User-friendliness

Michelangelo uses MA Studio UI
Metaflow uses a human-friendly API

Some overlap in ‘primitives’ offered (compute, data, workflows

Summary

Uber’s Michelangelo and Netflix’s Metaflow illustrate two viable, yet opposing, theories of ML platforms: unified ML experience vs. pluggable compute/data primitives. Here’s where they are similar, where they contrast and where they both falter.

A consolidated ML platform is something both systems agree is necessary and valuable. Engineering effort required to…

setup access to compute and data
request/provision/allocate compute resources
Orchestrate workloads
Allow human-friendly observability of job state / progress / history

…is non-trivial and having individual teams do this each on their own is wasteful. Having a platform is no longer a differentiator for an organization's ML teams, but a requirement at a certain scale.

But the devil is in the details - each organization is different, and these 2 systems have evolved to prioritize serving other common ML features/concepts differently. Michelangelo excels at higher-level abstractions such as a unified UI and tracking/sharing idle compute resources (GPUs) across teams. Metaflow delivers on more robust distributed systems primitives such as its data streaming framework and an event-driven architecture. Metaflow also provides support for model explainers, a key use case for Netflix.

The one area neither of these systems (or at least the blog posts about them) touches on is data governance. Michelangelo has features for model governance, but it appears these features were added later and may not pertain to the data used to train those models. Metaflow has no mention of security, policy or audit trails. Governance can feel boring to many, but in any organization large enough to have an ML Platform, it’s likely an important topic and it’s a shame the systems don’t go deeper on it.

Still, Michelangelo and Metaflow are excellent examples of ML Platforms at large organizations.

Further Questions

Feature store parity – Michelangelo’s Palette is front-and-center; Metaflow leans on Iceberg + Fast Data. Do they solve the same latency/ownership pain, or is one focused on engineering reuse and the other on developer velocity?

Search This Blog

masudio - tech