Uber's Michelangelo vs. Netflix's Metaflow

  Uber's Michelangelo vs. Netflix's Metaflow Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that’s all of them - every interaction with the ride or eats app involves ML) would need to build their own data pipelines, feature stores, training clusters, model storage, etc.  It would take each team copious amounts of time to maintain and improve their systems, and common patterns/best practices would be hard to learn.  In addition, the highest priority use cases (business critical, e.g. rider/driver matching) would themselves need to ensure they have enough compute/storage/engineering resources to operate (outages, scale peaks, etc.), which would results in organizational complexity and constant prioritization battles between managers/directors/etc. Solution Michelangelo provides a single platform that makes the most common and most business critical ML use cases simple and intuitive for builders to use, while still allowing self-serve extensibi...

How to Frame Metric Collection




Depending on the type of software development you’re doing, it can be tough to figure out what metrics you need to collect. An iterative process (aka fancy words for trial-and-error) will work to get you to where you need to be eventually, but along the way the MTTR of outages will suffer and you might lose users/revenue.

There’s a simple way to think about metrics that will help you build an intuition on what to monitor and what to measure. It’s this:

Measure the business, measure the software. But don’t conflate the two.

Measuring the business is critical to being able to notify and escalate to the proper personnel when there is an outage. Business metrics include things that impact your bottom line - user signins per minute, items added to the cart per second, items sold per day. Anything that directly and immediately affects customers is a business metric.

Software metrics are signals about how your software is running. There are 3 categories of software metrics: OS metrics, generic server metrics and application metrics. OS metrics are things that can be measured at the OS level without knowing anything about the process(es) running into top of the OS - CPU, memory, network connections, etc. These will allow you to tune the software to measure performance and are necessary for debugging the hardiest of issues. Generic server metrics are things you’d be able to collect from any web server, application container, DB, message queue, cache server etc. - things like web requests per second, DB transactions per second or cache hit ratio. And lastly, application metrics are things you can collect that are specific to your application - whatever you want to publish to tell you the state and/or current operations of your application - it could mean you measure how long it took one important function to run per request.

This is a good place to start when creating a holistic set of metrics to monitor your system, and if you start here, you’ll have a good chance at getting all the details right as you go.

Comments

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture