This document provides a high level view of where Katib will grow in 2019. These objectives are based on Katib's Critical User Journey (CUJ), which can be found here.
The original Katib design document can be found here.
- Stabilize APIs for StudyJobs
- Fully integrate katib with existing E2E examples:
- Xgboost
- Mnist
- GitHub issue summarization
- Publish API documentation, best practices, tutorials
- Issues list
- Issues for 0.5.0 release
The objectives here are organized around the three stages defined in the CUJ:
Integration with KF distributed training components
- TFJob
- PyTorch
- Allow Katib to support other operator types generically #341
- Streamlining the StudyJob schema - providing simpler ways to write worker specs and metric collector specs.
- Expose more information in StudyJob status fields
- Integration with Jupyter notebooks and Fairing #355
- Allow users to start with an existing model from a notebook and do HP tuning with minimal code changes
- Allowing a StudyJob to be resumed with additional trials #346
- Generating StudyJob configurations and launching StudyJobs through UI
- Supporting additional suggestion algorithms #15
- Support for StudyJob deployment in a different namespace #343
- Enhance metrics collection
- May need to revisit the design - use a push model instead of pull model?
- UI enhancements: allowing data scientists to visualize results easier
- Support for persistent model and metadata storage
- Ideally users should be able to export and reuse trained models from a common storage
Designs are pending for the following new features:
- Multi-Tenancy Support
- NAS
- Batch scheduling
- Integration with Pipelines
- Early stopping feature
- Improve e2e test coverage
- Improve test harness
- Enhance release process; adding automation (see https://bit.ly/2F7o4gM)