Cluster Management at LinkedIn

- January 22, 2020

In 2014 LinkedIn released a cluster management solution called Helix. Helix solves some problems that arise when a system scales to be too large to manage even on just a few hosts. A successful system will start to go through a few transition states that, when large enough, will become frequent enough to require an automated solution.

First, your system will become too large to host on a single machine. So now you need to shard it.

Then your system will either have hosts fail once in a while, or some shards might start getting too big or taking too much load. So then you can start using replication to solve for that.

As your cluster grows, the average size of shards also grows - sometimes you’ll have to split shards because they become too big, or more broadly redistribute. So now you need something to allow for that.

Partitioning/sharding, fault tolerance and scalability - these are the higher level concepts just described, and the problems Helix solves for. If you can solve these problems, there’s a good chance your next bottleneck will be TCO (Total Cost of Ownership). With more hosts being used, you’ll need more efficient resource utilization - and you can get that with multitenancy. Multitenancy means allowing multiple tenants within a single process - you might have one tenant that’s high in CPU consumption but low in memory hosted on the same machine as another process with high memory usage and low CPU. This is cheaper than having those 2 tenants on 2 different machines with the same resources. If you have the ability to easily redistribute load, as Helix has, then you also have multitenancy, so that problem can be solved with Helix as well.

Although Helix does a lot, there are other problems it doesn’t solve. If you have load fluctuations on your system such that different shards/replicas will have their load increase or decrease at different times of day, then you’ll want those replicas to move around to different hosts so that load is always well distributed. But in order to have that, you’d need a system that monitors load metrics on each server in real time and is able to move shards around on the fly when that load fluctuates across the cluster. That means dynamic load balancing, and as of today, Helix does not support this.

Despite its shortcomings, for a system built 6+ years ago, Helix has stood up well against the test of time. Plus it’s open source, so it gets bonus points for contributing concrete value to the world of software engineering.

Search This Blog

masudio - tech

Uber's Michelangelo vs. Netflix's Metaflow

Cluster Management at LinkedIn

Comments

Post a Comment

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture