Uber's Michelangelo vs. Netflix's Metaflow

  Uber's Michelangelo vs. Netflix's Metaflow Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc. Solution Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibi...

The Best Intro to Distributed Systems

I recently switched teams at Facebook.  I went from working on user-facing search technologies to running the distributed systems that allow other teams at my company to move fast.  To be clear, I used to bake the break, now I build the bakers an oven!  Although I had some background in these systems, things change quickly and the field is deep with decades of knowledge to pull from and build upon.  I needed a refresher!

I found one - a friend recommended a book called 'Designing Data Intensive Applications'.  It sounded a bit off-topic - the title didn't have 'distributed' or 'systems' in it!  Turns out it gave me exactly what I needed.  DDIA is the best introduction and deep dive into all the areas of expertise that the DS specialty has to offer - and wow is this field deep!  The book requires a bit of background - you'll need to have a bachelor's in CPSC or similar experience, as the first few chapters build on that knowledge.  Once the foundation is set, the book goes into very important concepts that anyone who's been working in a software infrastructure environment like AWS, Google Cloud or Facebook infra, would need to know.  The foundational chapters cover reliability, maintainability and scalability and then go on to talk about how to think about and handle data - for example, how do you decide what encoding technology to use when starting a software project?

Then, in part II, the book goes into the real meaty parts of distributed systems - the things that are absolutely essential to know for anyone to be successful if you're going to go oncall for AWS or a team that owns & maintains an instance of ZooKeeper.  Replication, Transactions and the fundamental problems that the field of Distributed Systems is trying to solve - these are a few of the topics that Part II covers.

Lastly is the most fascinating part of the book - and it's really the reward after getting through all of the previous material - these are the newest and most interesting concepts that the field has to offer at this time.  Part III does a deep comparison between batch and stream processing and makes the argument that stream processing is the Next Big Thing in software architectures.  If batch processing was one of the most important developments at Google's inception, then stream processing will be the most important development in the inception of...well, whatever the next Google is!

After finishing the book and having taken a few notes & flashcards (which I'll post up somewhere for others to use when I get around to it), I was much more confident going into meetings and discussions about how our systems work, how they're currently broken and how they could be better.  The book was a watershed moment in my career, and I'd encourage anyone in software engineering role in a DS-focussed team to read it.  This one needs to go to the top of your 'software career-ing' reading list.

And please do send me a note if you end up reading it (or have already) and tell me your thoughts!

Comments

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture