Posts

Showing posts from February, 2020

Uber's Michelangelo vs. Netflix's Metaflow

  Uber's Michelangelo vs. Netflix's Metaflow Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc. Solution Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibi...

Why are Distributed Systems So Hard?

Image
Isn't it true you can just write deterministic code and if you do it right and work to fix all the bugs, eventually you'll have a simple that never does the wrong thing?  If that's not the case then why not?  Computers are deterministic - they're predictable and they only ever do exactly what you ask them to do and nothing more...right? It's complicated. On a single node (computer) most failures mean the entire node completely stops working - for example the power supply fails, the disk dies, or the motherboard gets fried.  These types of failures are easy to detect because the node won't be in a 'partially failed' state where sometimes it performs the functions it's asked and sometimes it doesn't. In a network, you have multiple nodes - possibly hundreds or thousands.  When one node fails, it is impossible to be 100% sure that node is completely down.  And if you can't be sure that a node won't come back to life, then you can...