-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Experiment tracking proposal #195
Changes from 4 commits
fc95d69
7b604cf
d3541d9
b12584b
9c80b2f
f538c81
668b1cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Experiment tracking | ||
|
||
This document is design proposal for new service within Kubeflow - experiment tracking. Need for tool like this was | ||
expressed in multiple issues and discussions. | ||
|
||
## What is experiment tracking | ||
|
||
Production machine learning systems can generate huge amount of models. Every training pass can produce at least one, potentially multiple models. | ||
If training jobs are generated automatically (for example via hyperparameter tuning or retraining as new data becomes available) this can quickly become | ||
thousands of models. It's important to be able to navigate this, select those with best performance, examine them in detail and setup inference cluster | ||
out of them. We need to track things like model location (on S3, GS or disk), model metrics (end accuracy, P1 score, whatever experiment requires) or logs location. | ||
|
||
## Scale considerations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The straight forward aproach would be to use kubernetes jobs for this. Let kubernetes handle the orchestration and GC. Each job would be configured with env variables. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need more than just number of replicas. It's important thing to consider when selecting underlying database There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I'm thinking an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if that's what you mean, but we've discussed using CRDs as experiments but decided against it. Sheer number of experiments involved and lack of querying is a problem. We still need database somewhere. As for running actual experiments, then yeah, they will be tfjobs so regular pods. |
||
|
||
We need to support scale in tens of thousands of models. Potentially adding garbage collection above this. Within this scale we need to be able to quickly select best models | ||
for particular problem. | ||
|
||
## Model provenance | ||
|
||
Another feature commonly asked for is model provenance. It's crucial to be able to reproduce results. For every model we need to record: | ||
|
||
* Inital state, whether it's random weights or based on preexisting models | ||
* Dataset used for training | ||
* Dataset used for testing | ||
* Feature engineering pipeline used | ||
* Katib study id | ||
* Model architecture (code used) | ||
* Hyperparameters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the code to create the model is in a VCS, e.g. git, it should also track the version of the code used to create the model There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree, that's what I meant by "model architecture". But good idea would be to make it point to
|
||
|
||
Part of it can be solved by integration with Pachyderm. | ||
|
||
## Model performance | ||
|
||
To be able to pick best model for problem, we need to record metrics. Metrics can differ problem to problem, but we can support single number as quality weight | ||
(user can define this number per experiment, whether it's accuracy, p1 score etc). We need to support very efficient queries using this metric. | ||
|
||
## Model introspection | ||
|
||
For selected models we should be able to setup model introspection tools, like Tensorboard. | ||
Tensorboard provides good utility, allows comparison of few models and recently it was announced that it will integrate with pytorch. I think it's reasonable to use Tensorboard | ||
for this problem and allow easy spawn of tensorboard instance for selected models. We might need to find alternative for scikit-learn. | ||
|
||
## Inference cluster setup | ||
|
||
For best model, we should easily spawn inference cluster. We should support tf-serving and Seldon. | ||
|
||
## Integration with TFJob and Katib | ||
|
||
It should be very easy, even automatic, to make entry to experiment tracking from TFJob. TF operator should be tightly integrated with it | ||
and Katib should be able to both read models and write new ones. | ||
|
||
Katib workflow could look like this: | ||
|
||
Get study hyperparameter space -> select all existing model for study_id -> find out which hyperparameter combination is missing -> create relevant training jobs and add records to experiment tracking. | ||
|
||
## UI | ||
|
||
It's very important to provide good UI that allows easy model navigation and most of features listed via button click: | ||
|
||
* Examine in tensorboard | ||
* Train with new data | ||
* Spawn inference cluster | ||
* Run tests | ||
* Show model provenance pipeline | ||
|
||
## Alternatives | ||
|
||
* Tensorboard - Wasn't meant for large number of models. It's better for very detailed examination of smaller number of models. Uses tf.Event files | ||
* MLFlow - One of big cons for this is using files as storage for models. That would require something like dask or spark to query them efficiently. Can store files in multiple backends (S3 and GCS among other things) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it really need dask or spark? It has REST API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well if you'll try to query 50000 records from one file (and by query I mean "highest value of X") it's going to require something more... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although MLFlow does store files on disk. It would save sometime if folks looked at forking it and then integrating a database to store the tracking information. |
||
* ModelDB - Requires mongodb which is problematic | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not only mongo but sqllite. And it reset the sqlite at the beginning of a process. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, so it's no good for persistent experiment tracking, which we're after There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about first design a better modelDB equivalent and then use that for tracking experiments? I would recommend we keep each of these very independent for now. So that Kubeflow components/apps can integrate with a wide variety of tools e.g. TFX, katib, autoML. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reason I think our current model is flawed is because we have 2 sources of truth. Katib uses sql, modeldb uses mongodb or sqlite. Every time you want modeldb will sync stuff from katibs db. That means if you do sync with tens of thousands of models, it's going to lock whole system. I think we should build single source of truth of where models are and how they performed and Katib should use it. This would negate need for Katibs database alltogether and, therefore, made it much easier to handle. In another issue we've discussed Katib as model management tool, but we've decided that Katibs scope is hyperparameter tuning, and model management is something different (however required). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi team, I'm the PM on the MLflow team at Databricks. Some of the engineers will chime in here too. Adding a database-backed tracking store to the tracking server is on our roadmap, and there is already a pluggable API! |
||
* StudioML - Also uses FS/object store as backend, which have same querying considerations as MLFlow | ||
|
||
All 3 have some subset of features, but none of them seems to be designed for scale we're aiming for. They don't have integration with Kubeflow as well. | ||
|
||
## Authors | ||
|
||
* inc0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should focus on
experiment tracking
. This is different from monitoring your production models, like gathering metrics about model drift or accuracy in the production env.