Container Road Map #61

angerson · 2021-12-10T00:03:47Z

Road Map for Docker Containers

This is the same roadmap document that I'm using internally, with the internal bits taken out.

I am forcing these containers to get continuous support by using them for TF's internal CI: if they don't work, then our tests don't work. While I'm getting that ready during Q4 and Q1, I'm explicitly avoiding features that the TF team is not going to use, which would be dead-on-arrival unless we set up more testing for them, which I don't have the cycles to consider yet.

TF Nightly Milestone - Q4 Q1

Goal: Replicable container build of our tf-nightly Ubuntu packages

Containers can build tf-nightly package
SIG Build repository explains how to build tf-nightly package in Containers
Documentation exists on how to make changes to the containers
Suite of Kokoro jobs exists that publishes 80%-identical-to-now tf-nightly via containers
TF-nightly is officially built with the new containers
Documentation exists on how to use and debug containers

Release Test Milestone - Q4 Q1

Goal: Replicable container builds of our release tests, supporting each release

Containers can run same-as-now Nightly release tests
SIG Build repository explains how to run release tests as we do
Suite of CI jobs exists that matches current rel/nightly jobs
Existing release jobs replaced (but reversible if needed) by Container-based equivalent
Containers may be maintained and updated separately for TF release branches
Containers used for nightly/release libtensorflow and ci_sanity (now "code_check") jobs

CI & RBE Milestone - Q4 Q1/Q2

Goal: The main tests and our RBE tests use the same Docker container, updated in one place

Containers support internal presubmit/continuous build behavior
Containers are used in internal buildcop-monitored, DevInfra-owned presubmit/continuous jobs
Containers can be used in RBE
Containers are used as RBE environment for internal buildcop-monitored, DevInfra-owned jobs
DevInfra's GitHub-side presubmit tests use the containers
Containers are published on gcr.io
There is an easy way to verify if a change to the containers will not break the whole internal test suite

Forward Planning Milestone - Q2

Goal: Establish clear plan for any future work related to these containers. This is internal team planning stuff so I've removed it.

Downstream & OSS Milestone - Q2/Q3

Goal: Downstream users and custom-op developers use the same containers as our CI

SIG Addons / SIG IO use these Containers (or derivative) instead of old custom-op ones
Custom-op documentation migrated to SIG Build repository
Resolve: what to do about inconvenient default packages for e.g. SIG Addons (keras-nightly, etc.)
Resolve: what to do about inconveniently large image sizes for e.g. GPU content not needed
Docker-related documentation on tensorflow.org replaced with these containers
"devel" containers deprecated in favor of SIG Build containers

bhack · 2021-12-10T19:40:11Z

Thanks for sharing the roadmap.
It could be a little bit hard to understand steps mentioning "internal/our" requirements but I think it is expected.

Taking a look at the new Github Actions that we have here in the repository it is really super-clear what we are doing and when we are what on the OSS side with the limit to what we have orchestrated with Github Action.

When we are mixing OSS receipts/code and internal not visible stuffs/steps (e.g. like orchestration, args like commits for nightly etc..) it could be a little bit hard to follow the machinery if the not visible part is not compensated by some documentation details (e.g. what event/cron will start the scripts, what is the script chains, what are the args etc..).

But also having this documentation compensation generally it will bet under a constant risk to be outdated as probably internal teams have a direct visibility on the internal changes and so the operations will be not directly impacted by an outdated public documentation.

But as Github Actions rely on a well know and popular YAML dialect, and Github users/contributors/develoeprs are generally skilled on this dialect, do you think that it could be possible to setup a TF own self-hosted Github Actions runners on the Google Cloud so that we have a complete overview on the TF OSS build and orchestration and probably also a little bit of autonomy to the SIG without adding too much overhead to the system?

A Google Cloud team is maintaining all the tools to (auto)deploy self-hosted Github Actions runners on Google GKE:
https://github.com/terraform-google-modules/terraform-google-github-actions-runners

angerson added the sig build dockerfiles Relating to the TF SIG Build Dockerfiles label Dec 10, 2021

angerson self-assigned this Dec 10, 2021

angerson pinned this issue Dec 10, 2021

bhack mentioned this issue Jan 31, 2022

Document weight management for KerasCV keras-team/keras-cv#71

Closed

bhack mentioned this issue Sep 10, 2022

Migrate to rely on GCB from GitHub actions keras-team/keras-cv#782

Closed

mihaimaruseac unassigned angerson Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Road Map #61

Container Road Map #61

angerson commented Dec 10, 2021 •

edited

Loading

bhack commented Dec 10, 2021

Container Road Map #61

Container Road Map #61

Comments

angerson commented Dec 10, 2021 • edited Loading

Road Map for Docker Containers

TF Nightly Milestone - Q4 Q1

Release Test Milestone - Q4 Q1

CI & RBE Milestone - Q4 Q1/Q2

Forward Planning Milestone - Q2

Downstream & OSS Milestone - Q2/Q3

bhack commented Dec 10, 2021

angerson commented Dec 10, 2021 •

edited

Loading