-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add default k8s resources for init containers #4552
Conversation
This commit adds resource requests and limits for CPU and memory to Tekton-controlled init containers. Prior to this commit, if a ResourceQuota was defined in a namespace, Tekton could not be used in that namespace because not all containers in the generated pods had CPU and memory requests and limits. This commit does not affect Tekton's handling of LimitRange. Tekton will continue to transform container resource requests and limits to fit LimitRanges. * What do these init containers do? * - "place-tools" runs the entrypoint binary and writes it into each step container - "place-scripts" encodes each step's script and writes it to the container that will run it. We do not limit the size of scripts. - "step-init" initializes /tekton/steps with one file per step - "working-dir-initializer" creates a working dir for each step that needs it * Where do these resource requirements come from? * - The latest entrypoint binary is about 80MB. Previous builds were closer to 40-50MB. - CPU requirements, and memory requirements for non-entrypoint containers, are guesses with some simple local experimentation. (It's challenging to get metrics for resource usage of Tekton init containers, and we'd need some sort of performance or load tests to really get a good estimate.) * Side effects * - Tekton pods now have "burstable" quality of service by default instead of "best effort", meaning they will be more likely to be scheduled and less likely to be evicted. It's no longer possible to create a Tekton pod with "best effort" quality of service. * Drawbacks * - If we update the entrypoint code and the binary changes size, the resource requirements will need to be adjusted. Because Tekton releases include building the entrypoint binary, this could break a release and we'd have to backport a fix. - If a pod is OOMkilled due to exceeding resource requirements, the user will not be able to work around this problem. This can be mitigated by setting the init container's resource requirements to higher than what it likely needs. * Alternatives * - We could apply resource requirements from a Task's StepTemplate to init containers. However, the StepTemplate is meant to be used for Steps, which likely have higher resource requirements than Tekton's init containers, and this would require users to understand what Tekton's init containers do. - We could try to set resource requirements dynamically based on the number of Steps, the size of scripts, or other parameters. - We could apply resource requirements to init containers only when ResourceQuotas are present in the namespace. * Future work * - Combine Tekton init containers into one init container * References * - Resource quota docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/ - Pod quality of service docs: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ - Tekton entrypoint images: https://pantheon.corp.google.com/gcr/images/tekton-releases/us/github.com/tektoncd/pipeline/cmd/entrypoint?project=tekton-releases
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
FYI @pritidesai @guillaumerose This is an example of a change where performance benchmarking would be really useful! |
@vdemeester would love to hear your thoughts here |
@lbernick: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
1 similar comment
@lbernick: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Thank you for the through PR description!
This side effect seems to be a perfect candidate to be included in the release notes and probably in the docs somewhere?
This feels like something we should document (at least in a comment in the code). |
We could make these values configurable via a platform wide config map (possibly one of the existing ones). |
By default they would be "burstable" and if you have LimitRange in your namespace, they would become "Guaranteed" — or if users set requests/limits themselves. I think it's close to the 2nd point of drawbacks (below), but this mean the users might/will have to take these into account to set the requests/limits on their step to be able to have Pod schedulable.
Setting it higher can also render the Pod unschedulable anywhere. I think this is one reason for making those requests/limits configurable. Another drawback, at least as the PR is done, is that, it could create Pod that are invalid from the LimitRange standpoint. Aka if the value we decided for the init containers are too low or too high, the LimitRange validation will fail and the Pod won't be schedulable. In addition to be configurable by the user, this means we should also "adapt" those values (requests/limits on InitContainers) based on the LimitRange if present — aka making sure they are in the "allowed" range.
The third point could be something indeed. I think it might be a mix of several of those. We should give the user a way to specify default value, and also, most likely, a "mode" of working (like "set only if ResourceQuotas or LimitRange", "set always", "set never", …)
In theory this is less important I think, as init containers are supposed to run in parallel and thus putting all the things in one init container shouldn't change too much — but this is all theory though 😛. The trick of merging them all in one init container is that it might make the value of requests/limits harder to decide. Overall, I think we should at least give some options to the users on whether they want/need to have requests/limits on the init containers (and most likely their values, with us providing a decent default) 👼🏼 👍🏼 |
Thanks for the feedback @vdemeester @afrittoli!
Added to release notes!
I think for guaranteed QoS you need request = limit, and my reading of our code/docs on limit range is that we set the highest limits and lowest requests allowed by the limit range.
Happy to change the limit range logic for init containers as part of this PR! I don't think we provide a way for users to specify init containers so all the init containers should be controlled by Tekton.
Good point, added to "drawbacks".
I agree, even if we have only one init container we will still run into the same problems here. TBH, I don't think this PR should be merged, at least until we figure out a way around some of these drawbacks. The resource requirements I've used here are mostly guesses; do you have any suggestions on how we could get better estimates? It sounds like both of you see some value in allowing the user to configure some of this behavior. What would this configuration look like ideally? |
@wlynch suggested that we could pull these default values from a configmap, and that users could modify the configmap if there are any issues with the defaults provided. The idea here would be that most users don't have to interact with this configmap, and would only use it if the defaults caused a problem or if they wanted to do something more specific such as a guaranteed qos for pods. Going to close this PR as I'm not actively working on it but this can easily be reopened. |
Changes
This commit adds resource requests and limits for CPU and memory to Tekton-controlled init containers.
Prior to this commit, if a ResourceQuota was defined in a namespace, Tekton could not be used in that namespace
because not all containers in the generated pods had CPU and memory requests and limits. Closes #2933.
This commit does not affect Tekton's handling of LimitRange. Tekton will continue to transform container
resource requests and limits to fit LimitRanges.
What do these init containers do?
We do not limit the size of scripts.
Where do these resource requirements come from?
(It's challenging to get metrics for resource usage of Tekton init containers, and we'd need
some sort of performance or load tests to really get a good estimate.)
Side effects
be more likely to be scheduled and less likely to be evicted.
It's no longer possible to create a Tekton pod with "best effort" quality of service.
Drawbacks
Because Tekton releases include building the entrypoint binary, this could break a release and we'd have to backport a fix.
This can be mitigated by setting the init container's resource requirements to higher than what it likely needs.
Alternatives
is meant to be used for Steps, which likely have higher resource requirements than Tekton's init containers,
and this would require users to understand what Tekton's init containers do.
Future work
References
/kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes