Why we do machine learning engineering with YAML, not notebooks

Notebooks are great for designing models, not deploying them.

Caleb Kaiser
Source: Pexels

Most data scientists spend the majority of their working hours in a notebook. As a result, most production machine learning platforms prioritize notebook support. If you try out a new production ML platform, chances are its onboarding tutorial will begin with a .ipynb file.

When we built Cortex, our production machine learning platform, we spent a lot of time considering the correct interface for defining production ML pipelines. Ultimately, we decided not to support notebooks, opting instead for YAML config files.

In all literate programming tools, the emphasis is on presentation, which is a big reason why notebooks are so useful.

For many data scientists, the finished product of a work session is a business analysis. They need to show team members—who oftentimes aren’t technical—how their data became a specific recommendation or insight.

A notebook, where paragraphs of formatted text can lay between cells of code and where charts can be displayed directly beneath the code that generates them, is an ideal format for this presentation.

Even better, notebooks are interactive. Want to see what the chart looks like with a second dataset? Just add a new cell. Want to test a different model? Tweak one line of code and rerun the cell.

However, the same qualities that make notebooks great for exploring and explaining data make them a poor fit for production.

The priorities in building a production machine learning pipeline—the series of steps that take you from raw data to product—are not fundamentally different from those of general software engineering. Specifically, they are:

1. Your pipeline should be reproducible

Instead of trying to streamline a notebook’s various imports and function calls into a more easily reproducible script, why not use something simple and declarative like YAML?

For example, this this cortex.yaml file defines the deployment stage of a pipeline:

The code to be executed, predictor.py, is clear, as are its configuration variables. It’s simple, readable, and will produce predictable results.

Now, there are some projects focused on parameterizing notebooks so that they can be treated as pure functions, but it’s always felt like an unnecessary “square peg in a round hole” effort to me.

2. Collaborating on your pipeline should be easy

Git works by tracking the plaintext differences between file versions. With code, this results in a very readable experience, where you can easily visualize what is changing and how it impacts the software:

Source: Cortex Repo

Notebook files, however, are essentially giant JSON documents that contain the base-64 encoding of images and binary data. For a complex notebook, it would be extremely hard for anyone to read through a plaintext diff and draw meaningful conclusions—a lot of it would just be rearranged JSON and unintelligible blocks of base-64.

When you combine this with the frailty of complicated notebooks, where cells often need to be run in an arbitrary but precise order to generate the right result, it makes collaboration tricky.

For example, imagine you had an ETA prediction feature, and your pipeline relied on a complicated notebook to export a trained model. No one would be able to work on the notebook, as any small tweak might lead to invisible but cascading changes, such that your model performs poorly.

Trying to reverse engineer what changes caused the performance drop would be hopeless, both because of the unreadable nature of notebook diffs and because of the explainability problems mentioned earlier. Your pipeline would, in essence, have a “don’t touch it or it will break” sign on it.

With YAML, however, this problem is solved. There is no hidden state or arbitrary execution order in a YAML file, and any changes you make to it can easily be tracked by Git:

Source: Cortex Repo

If one of those changes breaks your model, it’s both reversible and investigable.

As with the last example, there are some projects dedicated to making diffing and merging notebooks easier, but it seems like a lot of effort to emulate YAML’s default nature.

3. All code in your pipeline should be testable

  • Engineers write tests before pushing any code.
  • PRs are automatically reviewed by CI/CD tooling.
  • A final manual review is given by another engineer.

As a result, anytime the codebase is changed, it is done with the highest possible level of confidence that it will not break things.

With notebooks, this is difficult.

Python unit testing libraries, like unittest, can be used within a notebook, but standard CI/CD tooling has trouble dealing with notebooks for the same reasons that notebook diffs are hard to read.

As a result, it’s hard to ship a new notebook to production with a high level of confidence that it won’t break anything—and if something does break, good luck figuring out why.

Applying CI/CD to YAML files and the code they reference, on the other hand, is straightforward. Devops teams have been doing it for years.

To do that, we needed to build an interface that allowed users to specify which code should be executed at what time, with which configuration.

YAML and notebooks are both tools for that purpose, in a sense. A notebook, at a very basic level, is just a bunch of JSON that references blocks of code and the order in which they should be executed.

But notebooks prioritize presentation and interactivity at the expense of reproducibility. YAML is the other side of that coin, ignoring presentation in favor of simplicity and reproducibility—making it much better for production.

Leave a Reply

Your email address will not be published.