

(metamorworks/Shutterstock)
Data scientists want to do data science. It’s right there in the title, after all. But data scientists often are asked to do other things besides building machine learning models, such as creating data pipelines and provisioning compute resources for ML training. That’s not a good way to keep data scientists happy.
To keep Netflix’s 300 data scientists happy and productive, in 2017 the team leader for the Netflix machine learning infrastructure team, Savin Goyal, led development of a new framework that abstracts away some of these less data science-y activities and allow them to focus more of their time on data science. The framework, called Metaflow, was released by Netflix as an open source project in 2019, and has been widely adopted since.
Goyal recently sat down with Datanami to talk about why he created Metaflow, what it does, and what customers can expect from an enterprise version of Metaflow at his startup, Outerbounds.
“If you want to do machine learning, there are a few raw ingredients that are needed,” Goyal says. “You need some place to store and manage your data, some way to access that data. You need some place to orchestrate compute for your machine learning models. Then there are all the concerns around MLops–versioning, experiment tracking, model deployment and whatnot. And then you need to figure out how is the data scientist actually authoring their work.
“So all of these raw building blocks in many ways exist, but then it’s still sort of left up to a data scientist to cross the barriers from one tool to the other,” he continues. “So that’s where Metaflow comes in.”
Metaflow helps by standardizing many of these processes and tasks, thereby allowing a data scientist to focus more on machine learning activities using Python or R, or any framework. In other words, it makes the data scientist “full stack,” Goyal says.
“It’s really important for a company like Netflix to provide a common platform for all its data scientists to be a lot more productive,” he says. “Because one of the big issues that bites most enterprises is that, if you have to deliver value with machine learning, not only do you need people who are great with data science, but you also need to enable folks to navigate internal system complexity.”
With Metaflow in place, Netflix’s data scientists (they called themselves machine learning engineers) don’t have to worry about how to connect to various internal data sources, or how to get access to large compute instances, Goyal says. Metaflow automates many of the aspects of running training pipelines or inference pipelines at scale on Netflix’s cloud platform, which is AWS.
In addition to automating access to data and compute, Metaflow also provides MLOps capabilities that help data scientists document their work through code snapshots and other capabilities. According to Goyal, the ability to reproduce results is one of the big benefits provided by the framework.
“One of the things that lacks in machine learning traditionally is problem reproducibility,” he says. “Let’s say you’re a data scientist and you’re training a model. Oftentimes, nobody else would be able to reproduce your results. Or if you’re running into an issue, nobody would be able to reproduce that issue to debug that. So we basically provide guarantees around reproducibility that then enables people to share results within teams, and that fosters collaboration.”
Metaflow also lets users mix and match different cloud instance types within a given ML workflow, which helps them to reduce costs, Goyal says. For example, say a data scientist wants to train an ML model using a large data set housed in Snowflake. Since it’s a lot of data, it needs to be go through a memory-intensive analysis process first, he says. Then the data scientist may want to train models on GPUs. Following the compute-intensive training process, deploying the model for inference requires fewer resources.
“You can basically carve out different sets of your workflow to run on different instance types and different resources,” Goyal says. “That further lowers your overall cost of training a machine learning model. You do not want to pay for a GPU instance while you’re only doing something that’s memory-intensive and doesn’t really engage the GPU.”
Metaflow allows data scientists to use their development tool or framework of choice. They can continue to run TensorFlow, PyTorch, scikitlearn, XGBoost, or any other ML framework they want. While there is a GUI for Metaflow, the primary method of interaction with the product is by including decorators in their Python or R code. At runtime, the decorators determine how the code with execute, Goyal says.
“We are basically targeting people who know data science,” he says. “They don’t want to be taught data science. They’re not looking for a no-code, low-code solution. They’re looking for a solution that firmly puts them in control while abstracting away all infrastructure concerns. That’s where Metaflow comes in.
Since Netflix initially released Metaflow to open source in 2019, it’s been widely adopted by hundreds of companies. According to the project’s website, it’s being used by Goldman Sachs, Autodesk, Amazon, S&P Global. Dyson, Intel, Zillow, Merck, Warner Media, and Draft Kings. CNN, another user, reports seeing an 8x performance boost in terms of the number of models they put into production over time, Goyal says.
The open source project on GitHub has 7,000 stars, putting it in the top one or two projects in this space, he says. The Slack channel is quite busy, with about 3,000 active members, he says. Since being released initially for AWS, Metaflow has since been adapted to work on Microsoft Azure and Google Cloud, as well as Kubernetes, Goyal says. It also has been used with hosted clouds from Oracle and Dell.
In 2021, Goyal co-founded Outerbounds with Ville Tuulos, a former Netflix colleague, and Oleg Avdeev, who hailed from MLOps vendor Tecton. Goyal and his team at the San Francsico-based company continue to be the primary developers of the open source Metaflow project. Four months ago, Outerbounds launched a hosted version of Metaflow that allows users to get up and running very quickly on AWS.
Because Outerbounds is controlling how the infrastructure is deployed, it’s able to offer guarantees around security, performance, and fault-tolerance with its managed offering compared to the open-source version, Goyal says.
“In open source, we have to make sure that our offering works for every single user out there who wants to use us. In certain specific areas, that’s sort of easier said than done,” he says. “If you’re using our managed offering then we can afford to take certain very specific opinions” about the deployment.
Reducing cloud spending is a big focus for the Outerbounds offering, particularly with the scarcity and expense of GPUs these days. Eventually, the company plans to enable customers to tap into the power of GPUs residing on-prem, provided there is some sort of connection to the hyperscaler of choice.
“As you start scaling machine learning models, things start getting expensive quite quickly and there are a lot of mechanisms to lower that cost quite a bit,” Goyal says. “Do I know that my GPUs are not data starved, for example. How can we make sure that you’re able to move data with the highest throughput possible? That’s more than cloud providers would provide you.”
Related Items:
The Future of Data Science Is About Removing the Shackles on AI
Birds Aren’t Real. And Neither Is MLOps
A ‘Breakout Year’ for ModelOps, Forrester Says
April 24, 2025
- StorONE and Phison Unveil Storage Platform Designed for LLM Training and AI Workflows
- Dataminr Raises $100M to Accelerate Global Push for Real-Time AI Intelligence
- Elastic Announces General Availability of Elastic Cloud Serverless on Google Cloud Marketplace
- CNCF Announces Schedule for OpenTelemetry Community Day
- Thoughtworks Signs Global Strategic Collaboration Agreement with AWS
April 23, 2025
- Metomic Introduces AI Data Protection Solution Amid Rising Concerns Over Sensitive Data Exposure in AI Tools
- Astronomer Unveils Apache Airflow 3 to Power AI and Real-Time Data Workflows
- CNCF Announces OpenObservabilityCon North America
- Domino Wins $16.5M DOD Award to Power Navy AI Infrastructure for Mine Detection
- Endor Labs Raises $93M to Expand AI-Powered AppSec Platform
- Ocient Announces Close of Series B Extension Financing to Accelerate Solutions for Complex Data and AI Workloads
April 22, 2025
- O’Reilly Launches AI Codecon, New Virtual Conference Series on the Future of AI-Enabled Development
- Qlik Powers Alpha Auto Group’s Global Growth with Automotive-Focused Analytics
- Docker Extends AI Momentum with MCP Tools Built for Developers
- John Snow Labs Unveils End-to-End HCC Coding Solution at Healthcare NLP Summit
- PingCAP Expands TiDB with Vector Search, Multi-Cloud Support for AI Workloads
- Qumulo Launches New Pricing in AWS Marketplace
April 21, 2025
- PayPal Feeds the DL Beast with Huge Vault of Fraud Data
- OpenTelemetry Is Too Complicated, VictoriaMetrics Says
- Will Model Context Protocol (MCP) Become the Standard for Agentic AI?
- Thriving in the Second Wave of Big Data Modernization
- What Benchmarks Say About Agentic AI’s Coding Potential
- Google Cloud Preps for Agentic AI Era with ‘Ironwood’ TPU, New Models and Software
- Google Cloud Fleshes Out its Databases at Next 2025, with an Eye to AI
- Can We Learn to Live with AI Hallucinations?
- Monte Carlo Brings AI Agents Into the Data Observability Fold
- AI Today and Tomorrow Series #3: HPC and AI—When Worlds Converge/Collide
- More Features…
- Grafana’s Annual Report Uncovers Key Insights into the Future of Observability
- Google Cloud Cranks Up the Analytics at Next 2025
- New Intel CEO Lip-Bu Tan Promises Return to Engineering Innovation in Major Address
- AI One Emerges from Stealth to “End the Data Lake Era”
- SnapLogic Connects the Dots Between Agents, APIs, and Work AI
- Snowflake Bolsters Support for Apache Iceberg Tables
- GigaOM Report Highlights Top Performers in Unstructured Data Management for 2025
- Supabase’s $200M Raise Signals Big Ambitions
- New Benchmark for Real-Time Analytics Released by Timescale
- Big Data Career Notes for March 2025
- More News In Brief…
- Gartner Predicts 40% of Generative AI Solutions Will Be Multimodal By 2027
- MinIO: Introducing Model Context Protocol Server for MinIO AIStor
- Dataiku Achieves AWS Generative AI Competency
- AMD Powers New Google Cloud C4D and H4D VMs with 5th Gen EPYC CPUs
- Prophecy Introduces Fully Governed Self-Service Data Preparation for Databricks SQL
- Seagate Unveils IronWolf Pro 24TB Hard Drive for SMBs and Enterprises
- CData Launches Microsoft Fabric Integration Accelerator
- MLCommons Releases New MLPerf Inference v5.0 Benchmark Results
- Opsera Raises $20M to Expand AI-Driven DevOps Platform
- GitLab Announces the General Availability of GitLab Duo with Amazon Q
- More This Just In…