Uber's Michelangelo vs. Netflix's Metaflow

  Uber's Michelangelo vs. Netflix's Metaflow Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc. Solution Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibi...

Cluster Management at LinkedIn



In 2014 LinkedIn released a cluster management solution called Helix. Helix solves some problems that arise when a system scales to be too large to manage even on just a few hosts. A successful system will start to go through a few transition states that, when large enough, will become frequent enough to require an automated solution.

First, your system will become too large to host on a single machine. So now you need to shard it.

Then your system will either have hosts fail once in a while, or some shards might start getting too big or taking too much load. So then you can start using replication to solve for that.

As your cluster grows, the average size of shards also grows - sometimes you’ll have to split shards because they become too big, or more broadly redistribute. So now you need something to allow for that.

Partitioning/sharding, fault tolerance and scalability - these are the higher level concepts just described, and the problems Helix solves for. If you can solve these problems, there’s a good chance your next bottleneck will be TCO (Total Cost of Ownership). With more hosts being used, you’ll need more efficient resource utilization - and you can get that with multitenancy. Multitenancy means allowing multiple tenants within a single process - you might have one tenant that’s high in CPU consumption but low in memory hosted on the same machine as another process with high memory usage and low CPU. This is cheaper than having those 2 tenants on 2 different machines with the same resources. If you have the ability to easily redistribute load, as Helix has, then you also have multitenancy, so that problem can be solved with Helix as well.

Although Helix does a lot, there are other problems it doesn’t solve. If you have load fluctuations on your system such that different shards/replicas will have their load increase or decrease at different times of day, then you’ll want those replicas to move around to different hosts so that load is always well distributed. But in order to have that, you’d need a system that monitors load metrics on each server in real time and is able to move shards around on the fly when that load fluctuates across the cluster. That means dynamic load balancing, and as of today, Helix does not support this.

Despite its shortcomings, for a system built 6+ years ago, Helix has stood up well against the test of time. Plus it’s open source, so it gets bonus points for contributing concrete value to the world of software engineering.

Comments

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture