-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set all custom charts image.pullPolicy to IfNotPresent #258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing this right now and it appears that all cluster functionality is intact. So, that's good.
Beyond that, I want to confirm when we see a speedup from this change, and to what extent. (For testing, it's probably easiest to look at creation of Tensorflow-serving pods, since those use the largest Docker image of any pods we scale up and down.)
To this end, I think there are three independent cases to test:
- What speedup do we see when we try to create a second TF-serving pod on a node that already has one? Does the second pod download the Docker image, or does it use one already on the node? (With our standard node types, this situation would never occur, since we only have one GPU per node, so I don't expect to actually test it. It's listed here only for completeness.)
- What speedup do we see when a second concurrent node tries to start a TF-serving pod? If there's already one running node with TF-serving, does a second running node starting a TF-serving pod download its image from Docker Hub, or does it download it from the other node in the cluster?
- What speedup do we see when the cluster previously have a running TF-serving pod, but the node that hosted that pod has shut down, and now a new node is starting up and trying to run a TF-serving pod? Does the new node download the TF-serving image from Docker Hub, or can it copy it from some Kubernetes-level "image pool" within the cluster?
Tbh, it'd probably be sufficient to just rtfd to generate hypotheses for each of these cases first. Then, we could just test the ones that should show a performance improvement.
It looks like each new node will need to re-pull the However, when a |
Yeah, I bet you're right about that. I'm actually comfortable working off the assumption that every node maintains its own pool of downloaded images, so that the second time something happens on a node, it'll be faster now. Otherwise, though, I don't think think we'll see any speedup, since I'm guessing nodes can't share images directly, so new nodes will always download images from Docker Hub or the Google Container Registry or somewhere else outside the cluster. I really don't feel the need for testing this, tbh. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good. We'll see speedups in certain situations and I can't think of any situation where this would have a downside.
* Feature/cicd (#268) * Set all custom charts image.pullPolicy to IfNotPresent (#258) * setting TRANSLATE_COLON_NOTATION=false by default (#289) * Update Getting Started (#287) * Update PULL_REQUEST.md for grammar (#292) * Use gomplate to template patches/hpa.yaml. (#293) * default account has 100 firewalls, not 200. (#297) * Update all documentation and links to reference kiosk-console instead of kiosk (#295) * Use yq and helmfile build to dynamically deploy helm charts based on release name. (#300) * Upgrade the openvpn chart to latest 4.2.1. (#301) * Change CLUSTER in Makefile to kiosk-console to fix binary name issue. (#302) * update raw.gif and tracked.gif with new nearly perfect gif (#303) * Update default values for tf-serving (#306) * Update Redis to the latest helm chart before they migrate to bitnami (#307) * Update autoscaler to 0.4.1 (#308) * Update redis-janitor to 0.3.1 (#309) * Update frontend to 0.4.1. (#310) * Update OpenVPN command for version 4.2.1 (#313) * Upgrade consumers to 0.5.1 and update models to DeepWatershed. (#311) * Set no-appendfsync-on-rewrite=yes to prevent Redis latency issues during AOF fsync (#316) * Install yq in install_script.sh (#319) * Use 4 random digits for cluster names. (#318) * update to latest version of the frontend (#322) * Change default consumer machine type to n1-standard-2 (#323) * Upgrade benchmarking to 0.2.4 and fix for Deep Watershed models (#324) * Use GRAFANA_PASSWORD env var to override the default grafana password. (#325) * Update Getting Started docs with new user feedback (#321) * Add basic unit tests (#326) * Use the docker container to run integration tests. (#327) * Warn users if bucket's region and cluster's region do not match (#329) * Bump benchmarking to latest 0.2.5 release (#331) * Add Logo Banner and Update README (#332) * Add new menu option for default settings with 4 GPUs (#333) * Update HPA target to 2 keys per zip consumer pod. (#334) * Bump consumers to version 0.5.2 (#336) * Update consumer and benchmarking versions (#337) * Bump redis-janitor to 0.3.2 to fix empty key bug. (#339) * bump benchmarking to 0.3.1 to fix No route to host bug. (#341) * Allow users to select which zone(s) to deploy the cluster (#340) * Pin KUBERNETES_VERSION to 1.14. (#346) * Fix bug indexing into last array element of valid_zones. (#348) * Fix logs to indicate finality and be less redundant. (#351) * If KUBERNETES_VERSION is 1.14, warn user of potential future version removal (#352) Co-authored-by: dylanbannon <[email protected]> Co-authored-by: MekWarrior <[email protected]>
* remove chart quotes and set all image.pullPolicy to IfNotPresent * remove pullPolicy from helmfile, no need to override by default
* remove chart quotes and set all image.pullPolicy to IfNotPresent * remove pullPolicy from helmfile, no need to override by default
* remove chart quotes and set all image.pullPolicy to IfNotPresent * remove pullPolicy from helmfile, no need to override by default
No need to pull images for all containers if it already exists. This should reduce the deployment time of new pods.