Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set all custom charts image.pullPolicy to IfNotPresent #258

Merged
merged 2 commits into from
Mar 5, 2020

Conversation

willgraf
Copy link
Contributor

No need to pull images for all containers if it already exists. This should reduce the deployment time of new pods.

@willgraf willgraf changed the title Set all custom charts image.pullPolicy to IfNotAvailable Set all custom charts image.pullPolicy to IfNotPresent Feb 11, 2020
@willgraf willgraf added the enhancement New feature or request label Feb 11, 2020
Copy link
Contributor

@dylanbannon dylanbannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this right now and it appears that all cluster functionality is intact. So, that's good.

Beyond that, I want to confirm when we see a speedup from this change, and to what extent. (For testing, it's probably easiest to look at creation of Tensorflow-serving pods, since those use the largest Docker image of any pods we scale up and down.)

To this end, I think there are three independent cases to test:

  1. What speedup do we see when we try to create a second TF-serving pod on a node that already has one? Does the second pod download the Docker image, or does it use one already on the node? (With our standard node types, this situation would never occur, since we only have one GPU per node, so I don't expect to actually test it. It's listed here only for completeness.)
  2. What speedup do we see when a second concurrent node tries to start a TF-serving pod? If there's already one running node with TF-serving, does a second running node starting a TF-serving pod download its image from Docker Hub, or does it download it from the other node in the cluster?
  3. What speedup do we see when the cluster previously have a running TF-serving pod, but the node that hosted that pod has shut down, and now a new node is starting up and trying to run a TF-serving pod? Does the new node download the TF-serving image from Docker Hub, or can it copy it from some Kubernetes-level "image pool" within the cluster?

Tbh, it'd probably be sufficient to just rtfd to generate hypotheses for each of these cases first. Then, we could just test the ones that should show a performance improvement.

@willgraf
Copy link
Contributor Author

It looks like each new node will need to re-pull the tf-serving image. This means that the performance of our benchmarking will likely not be greatly improved.

However, when a tf-serving pod crashes, it should be able to come back up faster?

@dylanbannon dylanbannon changed the base branch from master to stable March 3, 2020 22:53
@dylanbannon
Copy link
Contributor

@willgraf

It looks like each new node will need to re-pull the tf-serving image. This means that the performance of our benchmarking will likely not be greatly improved.

However, when a tf-serving pod crashes, it should be able to come back up faster?

Yeah, I bet you're right about that.

I'm actually comfortable working off the assumption that every node maintains its own pool of downloaded images, so that the second time something happens on a node, it'll be faster now. Otherwise, though, I don't think think we'll see any speedup, since I'm guessing nodes can't share images directly, so new nodes will always download images from Docker Hub or the Google Container Registry or somewhere else outside the cluster.

I really don't feel the need for testing this, tbh.

Copy link
Contributor

@dylanbannon dylanbannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good. We'll see speedups in certain situations and I can't think of any situation where this would have a downside.

@willgraf willgraf merged commit 3ea79ca into stable Mar 5, 2020
@willgraf willgraf deleted the willgraf/chart-update branch March 5, 2020 18:27
willgraf added a commit that referenced this pull request May 22, 2020
* Feature/cicd (#268)

* Set all custom charts image.pullPolicy to IfNotPresent (#258)

* setting TRANSLATE_COLON_NOTATION=false by default (#289)

* Update Getting Started  (#287)

* Update PULL_REQUEST.md for grammar (#292)

* Use gomplate to template patches/hpa.yaml. (#293)

* default account has 100 firewalls, not 200. (#297)

* Update all documentation and links to reference kiosk-console instead of kiosk (#295)

* Use yq and helmfile build to dynamically deploy helm charts based on release name. (#300)

* Upgrade the openvpn chart to latest 4.2.1. (#301)

* Change CLUSTER in Makefile to kiosk-console to fix binary name issue. (#302)

* update raw.gif and tracked.gif with new nearly perfect gif (#303)

* Update default values for tf-serving (#306)

* Update Redis to the latest helm chart before they migrate to bitnami (#307)

* Update autoscaler to 0.4.1 (#308)

* Update redis-janitor to 0.3.1 (#309)

* Update frontend to 0.4.1. (#310)

* Update OpenVPN command for version 4.2.1 (#313)

* Upgrade consumers to 0.5.1 and update models to DeepWatershed. (#311)

* Set no-appendfsync-on-rewrite=yes to prevent Redis latency issues during AOF fsync (#316)

* Install yq in install_script.sh (#319)

* Use 4 random digits for cluster names. (#318)

* update to latest version of the frontend (#322)

* Change default consumer machine type to n1-standard-2 (#323)

* Upgrade benchmarking to 0.2.4 and fix for Deep Watershed models (#324)

* Use GRAFANA_PASSWORD env var to override the default grafana password. (#325)

* Update Getting Started docs with new user feedback (#321)

* Add basic unit tests (#326)

* Use the docker container to run integration tests. (#327)

* Warn users if bucket's region and cluster's region do not match (#329)

* Bump benchmarking to latest 0.2.5 release (#331)

* Add Logo Banner and Update README (#332)

* Add new menu option for default settings with 4 GPUs (#333)

* Update HPA target to 2 keys per zip consumer pod. (#334)

* Bump consumers to version 0.5.2 (#336)

* Update consumer and benchmarking versions (#337)

* Bump redis-janitor to 0.3.2 to fix empty key bug. (#339)

* bump benchmarking to 0.3.1 to fix No route to host bug. (#341)

* Allow users to select which zone(s) to deploy the cluster (#340)

* Pin KUBERNETES_VERSION to 1.14. (#346)

* Fix bug indexing into last array element of valid_zones. (#348)

* Fix logs to indicate finality and be less redundant. (#351)

* If KUBERNETES_VERSION is 1.14, warn user of potential future version removal (#352)

Co-authored-by: dylanbannon <[email protected]>
Co-authored-by: MekWarrior <[email protected]>
willgraf added a commit that referenced this pull request May 23, 2020
* remove chart quotes and set all image.pullPolicy to IfNotPresent

* remove pullPolicy from helmfile, no need to override by default
willgraf added a commit that referenced this pull request May 23, 2020
* remove chart quotes and set all image.pullPolicy to IfNotPresent

* remove pullPolicy from helmfile, no need to override by default
willgraf added a commit that referenced this pull request May 23, 2020
* remove chart quotes and set all image.pullPolicy to IfNotPresent

* remove pullPolicy from helmfile, no need to override by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants