Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Experiment tracking proposal #195

Closed
wants to merge 7 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions proposals/experiment-tracking-proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Experiment tracking

This document is design proposal for new service within Kubeflow - experiment tracking. Need for tool like this was
expressed in multiple issues and discussions.

## What is experiment tracking

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should focus on experiment tracking. This is different from monitoring your production models, like gathering metrics about model drift or accuracy in the production env.


Production machine learning systems may generate huge amount of models. We want the ability to compare multiple training runs and find the model that provides the best metrics. When trying different combinations of parameters to find the optimal parameters we may produce multiple models. The amount of training jobs are generated automatically (for example via hyperparameter tuning or retraining as new data becomes available) can quickly can grow to thousands of models. It's important to be able to navigate this, select those with best performance, examine them in detail and be able to compare them. Once we select the best model we can move on to the next step which is inference. We need to track things like model location (on S3, GS or disk), model metrics (end accuracy, P1 score, whatever experiment requires) or logs location. Our data scientists may require isolation as they are working on different experiments so they should have a view/ui that lets them find the experiment quickly.

## Example user stories
* I'm a data scientist working on a problem. I'm looking for easy way to compare multiple training jobs with multiple sets of hyperparameters. I would like to be able to select top 5 jobs measured with P1 score and examine which model architecture, hyperparameters, dataset and initial state contributed to this score. I would want to compare these 5 together in highly detailed way (for example via tensorboard). I would like rich UI to navigate models without need to interact with infrastructure.
* I'm part of big ML team in company. Our whole team works on single problem (for example search) and every person builds their models. I'd like to be able to compare my models with others. I want to be safe that nobody will accidentally delete model I'm working on.
* I'm cloud operator in ML team. I would like to take current production model (architecture+hyperparams+training state) and retrain it with new data as it becomes available. I would want to run suite of tests and determine if new model performs better. If it does, I'd like to spawn tf-serving (or seldon) cluster and perform rolling upgrade to new model.
* I'm part of highly sophisticated ML team. I'd like to automate retraining->testing->rollout for models so they can be upgraded nightly without supervision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This e.g. is not part of experiment tracking imho. It's about model management and model monitoring.
ls there a good/common term to describe this operational bit of models?
Model management, Model operations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model management is alternative term to experiment tracking I think. At least I've understood it as such. As for functionality, because we'll make it k8s-native, cost of adding this feature will be so low that I think we should do it just for users benefit. Ongoing monitoring of models isn't something in scope, but as long as this monitoring agent saves observed metrics (say avg accuracy over last X days) back to this service, you still can benefit from this.

Copy link

@durandom durandom Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the notion of "model management" and "experiment tracking" is slightly different. "management" has a production connotation and "experiment" has a devel connotation. Did @jlewi in this comment thread get to a common definition? This mlflow issue has also a discussion around the use case of the various tools around. And a google search for "experiment tracking" ai ml vs "model management" ai ml gives 500 vs 75k results.
Please dont get me wrong. I'm all for having a solution for this, because I think too this is a missing component of kubeflow.
I'd just limit the scope to the devel side of the house and let pachyderm and seldon focus on the production side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one clarification - when I'm saying, for example, model rollout, what I mean is single call to k8s to spawn seldon cluster. Actual serving, monitoring etc is beyond scope, I agree, but I think it'd be a nice touch to allow one-click mechanism. For Pachyderm integration look lower, I actually wanted to keep pipeline uuid in database. If someone will use pachyderm, we'll integrate with it and allow quick navigation. For example one-click link to relevant pachyderm ui

* I’m part of the ML team in company. I would like to be able to track my parameters used to train an ML job and track the metrics produced. I would like to have isolation so I can find the experiments I worked on. I would like the ability to compare multiple training rules side by side and pick the best model. Once I select the best model I would like to deploy that model to production. I may also want to test my model with historical data to see how well it performs and maybe roll out my experiment to a subset of users before fully rolling out to all users.


## Scale considerations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The straight forward aproach would be to use kubernetes jobs for this. Let kubernetes handle the orchestration and GC. Each job would be configured with env variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more than just number of replicas. It's important thing to consider when selecting underlying database

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm thinking an experiment would be a kubernetes primitive, like a job - no replicas involved. The job will be scheduled by kubernetes. So if you run 100 or 1000 experiments, you just create them and let kubernetes handle the scaling, i.e. the scheduling

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's what you mean, but we've discussed using CRDs as experiments but decided against it. Sheer number of experiments involved and lack of querying is a problem. We still need database somewhere. As for running actual experiments, then yeah, they will be tfjobs so regular pods.


We need to support scale in tens of thousands of models. Potentially adding garbage collection above this. Within this scale we need to be able to quickly select best models
for particular problem.

## Model provenance

Another feature commonly asked for is model provenance. It's crucial to be able to reproduce results. For every model we need to record:

* Inital state, whether it's random weights or based on preexisting models
* Dataset used for training
* Dataset used for testing
* Feature engineering pipeline used
* Katib study id
* Model architecture (code used)
* Hyperparameters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the code to create the model is in a VCS, e.g. git, it should also track the version of the code used to create the model

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, that's what I meant by "model architecture". But good idea would be to make it point to

  • code (including commit id)
  • docker image


Part of it can be solved by integration with Pachyderm.

## Model performance

To be able to pick best model for problem, we need to record metrics. Metrics can differ problem to problem, but we can support single number as quality weight
(user can define this number per experiment, whether it's accuracy, p1 score etc). We need to support very efficient queries using this metric.

## Model introspection

For selected models we should be able to setup model introspection tools, like Tensorboard.
Tensorboard provides good utility, allows comparison of few models and recently it was announced that it will integrate with pytorch. I think it's reasonable to use Tensorboard
for this problem and allow easy spawn of tensorboard instance for selected models. We might need to find alternative for scikit-learn. Perhaps we can try mlflow for scikit-learn.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also http://tensorboardx.readthedocs.io which can create a tensorboard from any python code.
We've started working with mlflow because it has a nice web ui and is easy to use with it's python framework.
I dont know if tensorboard with tensorboardx has some benefits though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does! You could use it to add Tensorboard to scikit-learn (just log accuracy every batch of training)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLflow and tensorboard and tensorboardx are complimentary. I have worked with some MLflow users who use them together. Check out this example in the MLflow repository of using MLflow with PyTorch and tensorboardx: https://github.com/mlflow/mlflow/blob/master/examples/pytorch/mnist_tensorboard_artifact.py

From the doc at the top of that code example:

Trains an MNIST digit recognizer using PyTorch, and uses tensorboardX to log training metrics and weights in TensorBoard event format to the MLflow run's artifact directory. This stores the TensorBoard events in MLflow for later access using the TensorBoard command line tool.


## Inference cluster setup

For best model, we should easily spawn inference cluster. We should support tf-serving and Seldon.

## Integration with TFJob and Katib

It should be very easy, even automatic, to make entry to experiment tracking from TFJob. TF operator should be tightly integrated with it
and Katib should be able to both read models and write new ones.

Katib workflow could look like this:

Get study hyperparameter space -> select all existing model for study_id -> find out which hyperparameter combination is missing -> create relevant training jobs and add records to experiment tracking.

## UI

It's very important to provide good UI that allows easy model navigation and most of features listed via button click:

* Examine in tensorboard
* Train with new data
* Spawn inference cluster
* Run tests
* Show model provenance pipeline

## Alternatives

* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files
* MLFlow - One of the cons for this is that experiment metadata is stored on disk. In kubernetes may require persistent volumes. A better approach would be to store metadata in a database. Pros, can store models in multiple backends (S3, Azure Cloud Storage and GCS among other things)
Copy link

@durandom durandom Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not care if mlflow stores it in a db or in a file. If you use mlflow we should use their REST api as the interface and let them handle persistence. And for a DB you'd also need a PV, so 🤷‍♂️

We have started a repo to make mlflow run on openshift: https://github.com/AICoE/experiment-tracking

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed database (like big table) doesn't (or at least you don't care about it). Biggest issue for MLFlow API (which is directly tied to file storage) is lack of querying. Currently (unless I'm mistaken) there is no good way in MLFlow to ask for model with highest accuracy. It could be implemented, but then comparisons would be done in python, so not super scalable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed database (like big table)

wouldnt this introduce a dependency that kubeflow wants to avoid?

Biggest issue for MLFlow API (which is directly tied to file storage) is lack of querying

Actually there is a REST API for that. But I havent used it and I'm not sure how good it scales

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#188 I've noted it here briefly with some consequences and how to make it manageable for operators (imho).

As for search string it's really not much. I still can't see option of "get me best model" without using spark/dask

* ModelDB - Requires mongodb which is problematic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only mongo but sqllite. And it reset the sqlite at the beginning of a process.
We can't persistent data without modification.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so it's no good for persistent experiment tracking, which we're after

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about first design a better modelDB equivalent and then use that for tracking experiments? I would recommend we keep each of these very independent for now. So that Kubeflow components/apps can integrate with a wide variety of tools e.g. TFX, katib, autoML.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason I think our current model is flawed is because we have 2 sources of truth. Katib uses sql, modeldb uses mongodb or sqlite. Every time you want modeldb will sync stuff from katibs db. That means if you do sync with tens of thousands of models, it's going to lock whole system. I think we should build single source of truth of where models are and how they performed and Katib should use it. This would negate need for Katibs database alltogether and, therefore, made it much easier to handle. In another issue we've discussed Katib as model management tool, but we've decided that Katibs scope is hyperparameter tuning, and model management is something different (however required).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team, I'm the PM on the MLflow team at Databricks. Some of the engineers will chime in here too. Adding a database-backed tracking store to the tracking server is on our roadmap, and there is already a pluggable API!

* StudioML - Also uses FS/object store as backend, which have same querying considerations as MLFlow

All 3 have some subset of features, but none of them seems to be designed for scale we're aiming for. They don't have integration with Kubeflow as well.

## Authors

* inc0
* zmhassan - Zak Hassan