Posts

Uber's Michelangelo vs. Netflix's Metaflow

  Uber's Michelangelo vs. Netflix's Metaflow Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc. Solution Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibi...

Cluster Management at LinkedIn

Image
In 2014 LinkedIn released a cluster management solution called Helix. Helix solves some problems that arise when a system scales to be too large to manage even on just a few hosts. A successful system will start to go through a few transition states that, when large enough, will become frequent enough to require an automated solution. First, your system will become too large to host on a single machine. So now you need to shard it. Then your system will either have hosts fail once in a while, or some shards might start getting too big or taking too much load. So then you can start using replication to solve for that. As your cluster grows, the average size of shards also grows - sometimes you’ll have to split shards because they become too big, or more broadly redistribute. So now you need something to allow for that. Partitioning/sharding, fault tolerance and scalability - these are the higher level concepts just described, and the problems Helix solves for. If you can sol...

The Curious Case of the Document Database

Image
Let’s talk about an oft-overlooked NoSQL database type. It’s got all the best parts of a k-v store and allows limited SQL-like query-ability too! They’re called document databases and they’re all the rage. A document database stores objects that can be serialized to JSON or some similar serialization format. These JSON ‘documents’ are keyed by an ID, similar to how k-v stores work. When you want to fetch an entire document, all you need is the key for it. But the magic of document databases allows you to fetch only pieces of a document and also to fetch data from multiple documents using selection criteria that mirrors basic SQL query functionality. This is all made possible by the tree-like structure that documents in a doc DB must conform to. JSON data can contain keyed fields, nested structures, and lists. Using this structure a doc DB can extract specific pieces of a doc so that the entire doc doesn’t have to be returned and parsed in the application layer. Query-ability...

Consistency in Redis

Image
Most uses of Redis will focus more on latency and availability rather than consistency - that’s because at its core, Redis is essentially a cache. Generally speaking, you store things in Redis in memory and you update or read them extremely quickly. You need to make sure that the cache is always available, so in most cases you’d only choose Redis if you’re leaning towards an A class system (A for Availability) rather than a C class system (Consistency). However, it’s important to know that a replicated instance of Redis is capable of giving you different levels of consistency up to and including read-after-write consistency - the kind of consistency that guarantees data reads from anywhere that happen after a successful response to a write will receive that write. Even if the read goes to a different replica than the write did. What Redis can’t give you is linearizability - the guarantee that any set of observers of the system will only be able to see a single copy of the syste...

CAP Theorem Explained

Image
When building large-scale software systems today, you have to make tradeoffs.  You can't have an ACID compliant data store with infinite storage/throughput/connections that's always available in any part of the world with super low latency where clients can read/write concurrently without any risk of inconsistencies that's free.  If you could, the problem would be solved and our industry could go build spaceships at SpaceX or retire and make sourdough every Sunday. Instead, we need to make tradeoffs.  Does our product/system need ACID semantics?  Is latency more important?  Can we allow certain types of data inconsistencies for a short time in favor of availability?  How much are we able to spend so that we don't have to sacrifice as much? These are some questions that everyone building a large-scale software system has to grapple with in the design phase.  A great way to begin your thinking is using CAP Theorem - or at least what it's slowly be...

XFN Development - What's it all About?

Image
XFN (cross-functional) work is one of the challenges of a senior engineer in most tech companies.  Broadly, it means to interact with team members of a different team than your own.  Concretely, this can mean anything from aligning goals or gathering feedback from other teams to inform your roadmap, to pair programming to flesh out the design of a new interface.  Your team wants you to make progress on the goals through XFN, but also to make the team look good (ie competent, smart, motivated) to those who are judging.  In my organization, this type of work is reserved for more seasoned engineers - although you're interacting with others on system designs, a lot of it is not the type of stuff taught in CS classes.  It's about personal interactions. When you first start talking to a member from a different team, it's important to ensure they feel that you're someone they want to work with.  You should be able to describe the system you own or are buildi...

Redis!...Huh? What ISN'T it good for?

Image
Redis is an in-memory key-value data store that allows you store your actual data structures rather than having a mapping layer between your application and your storage.  Support exists for any data type you'd need including lists, sets and hashes/maps. It's in-memory but also has options to push to disk - you can push to disk on every write with a huge performance cost, or at some regular interval.  Writes can be configured to happen via an append-only log, which makes them lightning fast. Pushing to disk every 1 second has comparable performance to never pushing to disk at all. Redis supports replication in a few different ways.  By default it's asynchronous, but can be configured to be synchronous for safety.  Combined with append-only logging on every write, you can have 100% consistency of your data on any successful write. Redis Cluster allows automatic sharding and handling of many different failure scenarios, so if a small number of the hosts in your c...

Don't NOT Repeat Yourself!

Image
Sometimes duplicate code is good!! ...what?  What do you mean?  You're 'on to me'? .... Ok ok ok hold on, just hear me out! Picture it: you're writing unit tests.  Headphones on, hoody blowing in the cold wind from your AC unit in your dark apartment. You've written all your tests, they're passing and you're feeling great.  You refactor.  The tests have a lot of duplicate setup code, so having the DRY sense of humor you've got, you mop up.  Get it all lookin' fine and tidy.  Common methods for all the setup, some parameters to handle the different configurations of the unit tests. Freshhhhhhhhhhhh :D You push your code to test env - it breaks - OH SHEEIT.  You missed a few edge cases. No prob, no prob! Just add a few unit tests, slip in an if-else here and there in your production code, and you can get your changes in and make it to office before the 2pm happy hour! Nice - easy peasy. But WAIT.  The edge cases you missed ...