[RFC] Add default k8s resources for init containers #4552

lbernick · 2022-02-08T16:06:11Z

Changes

This commit adds resource requests and limits for CPU and memory to Tekton-controlled init containers.
Prior to this commit, if a ResourceQuota was defined in a namespace, Tekton could not be used in that namespace
because not all containers in the generated pods had CPU and memory requests and limits. Closes #2933.

This commit does not affect Tekton's handling of LimitRange. Tekton will continue to transform container
resource requests and limits to fit LimitRanges.

What do these init containers do?

"place-tools" runs the entrypoint binary and writes it into each step container
"place-scripts" encodes each step's script and writes it to the container that will run it.
We do not limit the size of scripts.
"step-init" initializes /tekton/steps with one file per step
"working-dir-initializer" creates a working dir for each step that needs it

Where do these resource requirements come from?

The latest entrypoint binary is about 80MB. Previous builds were closer to 40-50MB.
CPU requirements, and memory requirements for non-entrypoint containers, are guesses with some simple local experimentation.
(It's challenging to get metrics for resource usage of Tekton init containers, and we'd need
some sort of performance or load tests to really get a good estimate.)

Side effects

Tekton pods now have "burstable" quality of service by default instead of "best effort", meaning they will
be more likely to be scheduled and less likely to be evicted.
It's no longer possible to create a Tekton pod with "best effort" quality of service.

Drawbacks

If we update the entrypoint code and the binary changes size, the resource requirements will need to be adjusted.
Because Tekton releases include building the entrypoint binary, this could break a release and we'd have to backport a fix.
If a pod is OOMkilled due to exceeding resource requirements, the user will not be able to work around this problem.
This can be mitigated by setting the init container's resource requirements to higher than what it likely needs.
If resource requirements are set too high, the pod may not be schedulable anywhere.

Alternatives

We could apply resource requirements from a Task's StepTemplate to init containers. However, the StepTemplate
is meant to be used for Steps, which likely have higher resource requirements than Tekton's init containers,
and this would require users to understand what Tekton's init containers do.
We could try to set resource requirements dynamically based on the number of Steps, the size of scripts, or other parameters.
We could apply resource requirements to init containers only when ResourceQuotas are present in the namespace.

Future work

Combine Tekton init containers into one init container (Can the "place-tools" and "place-scripts" init-containers merge into one init-container? #4519).

References

Resource quota docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/
Pod quality of service docs: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
Tekton entrypoint images: https://pantheon.corp.google.com/gcr/images/tekton-releases/us/github.com/tektoncd/pipeline/cmd/entrypoint?project=tekton-releases

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Docs included if any changes are user facing
Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

[Bug fix]: apply resource requests to init containers, allowing Tekton to be used in namespaces with resourcequotas.
[Side effect]: Pods will now have "burstable" QoS by default, instead of "best effort".

This commit adds resource requests and limits for CPU and memory to Tekton-controlled init containers. Prior to this commit, if a ResourceQuota was defined in a namespace, Tekton could not be used in that namespace because not all containers in the generated pods had CPU and memory requests and limits. This commit does not affect Tekton's handling of LimitRange. Tekton will continue to transform container resource requests and limits to fit LimitRanges. * What do these init containers do? * - "place-tools" runs the entrypoint binary and writes it into each step container - "place-scripts" encodes each step's script and writes it to the container that will run it. We do not limit the size of scripts. - "step-init" initializes /tekton/steps with one file per step - "working-dir-initializer" creates a working dir for each step that needs it * Where do these resource requirements come from? * - The latest entrypoint binary is about 80MB. Previous builds were closer to 40-50MB. - CPU requirements, and memory requirements for non-entrypoint containers, are guesses with some simple local experimentation. (It's challenging to get metrics for resource usage of Tekton init containers, and we'd need some sort of performance or load tests to really get a good estimate.) * Side effects * - Tekton pods now have "burstable" quality of service by default instead of "best effort", meaning they will be more likely to be scheduled and less likely to be evicted. It's no longer possible to create a Tekton pod with "best effort" quality of service. * Drawbacks * - If we update the entrypoint code and the binary changes size, the resource requirements will need to be adjusted. Because Tekton releases include building the entrypoint binary, this could break a release and we'd have to backport a fix. - If a pod is OOMkilled due to exceeding resource requirements, the user will not be able to work around this problem. This can be mitigated by setting the init container's resource requirements to higher than what it likely needs. * Alternatives * - We could apply resource requirements from a Task's StepTemplate to init containers. However, the StepTemplate is meant to be used for Steps, which likely have higher resource requirements than Tekton's init containers, and this would require users to understand what Tekton's init containers do. - We could try to set resource requirements dynamically based on the number of Steps, the size of scripts, or other parameters. - We could apply resource requirements to init containers only when ResourceQuotas are present in the namespace. * Future work * - Combine Tekton init containers into one init container * References * - Resource quota docs: https://kubernetes.io/docs/concepts/policy/resource-quotas/ - Pod quality of service docs: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ - Tekton entrypoint images: https://pantheon.corp.google.com/gcr/images/tekton-releases/us/github.com/tektoncd/pipeline/cmd/entrypoint?project=tekton-releases

tekton-robot · 2022-02-08T16:06:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign imjasonh
You can assign the PR to them by writing /assign @imjasonh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lbernick · 2022-02-08T16:07:48Z

FYI @pritidesai @guillaumerose This is an example of a change where performance benchmarking would be really useful!

lbernick · 2022-02-08T16:10:03Z

@vdemeester would love to hear your thoughts here

tekton-robot · 2022-02-08T16:32:56Z

@lbernick: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-tekton-pipeline-unit-tests	`050ff18`	link	`/test tekton-pipeline-unit-tests`
pull-tekton-pipeline-alpha-integration-tests	`050ff18`	link	`/test pull-tekton-pipeline-alpha-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tekton-robot · 2022-02-08T16:32:56Z

@lbernick: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-tekton-pipeline-unit-tests	`050ff18`	link	`/test tekton-pipeline-unit-tests`
pull-tekton-pipeline-alpha-integration-tests	`050ff18`	link	`/test pull-tekton-pipeline-alpha-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

afrittoli · 2022-02-09T22:26:25Z

Thank you for the through PR description!

Side effects

Tekton pods now have "burstable" quality of service by default instead of "best effort", meaning they will
be more likely to be scheduled and less likely to be evicted.
It's no longer possible to create a Tekton pod with "best effort" quality of service.

This side effect seems to be a perfect candidate to be included in the release notes and probably in the docs somewhere?

Drawbacks

If we update the entrypoint code and the binary changes size, the resource requirements will need to be adjusted.

This feels like something we should document (at least in a comment in the code).

afrittoli · 2022-02-09T22:28:56Z

If a pod is OOMkilled due to exceeding resource requirements, the user will not be able to work around this problem.
This can be mitigated by setting the init container's resource requirements to higher than what it likely needs.

We could make these values configurable via a platform wide config map (possibly one of the existing ones).

vdemeester · 2022-02-14T10:58:56Z

Side effects

Tekton pods now have "burstable" quality of service by default instead of "best effort", meaning they will
be more likely to be scheduled and less likely to be evicted.
It's no longer possible to create a Tekton pod with "best effort" quality of service.

By default they would be "burstable" and if you have LimitRange in your namespace, they would become "Guaranteed" — or if users set requests/limits themselves. I think it's close to the 2nd point of drawbacks (below), but this mean the users might/will have to take these into account to set the requests/limits on their step to be able to have Pod schedulable.

Drawbacks

If we update the entrypoint code and the binary changes size, the resource requirements will need to be adjusted.
Because Tekton releases include building the entrypoint binary, this could break a release and we'd have to backport a fix.

If a pod is OOMkilled due to exceeding resource requirements, the user will not be able to work around this problem.
This can be mitigated by setting the init container's resource requirements to higher than what it likely needs.

Setting it higher can also render the Pod unschedulable anywhere. I think this is one reason for making those requests/limits configurable.

Another drawback, at least as the PR is done, is that, it could create Pod that are invalid from the LimitRange standpoint. Aka if the value we decided for the init containers are too low or too high, the LimitRange validation will fail and the Pod won't be schedulable. In addition to be configurable by the user, this means we should also "adapt" those values (requests/limits on InitContainers) based on the LimitRange if present — aka making sure they are in the "allowed" range.

Alternatives

We could apply resource requirements from a Task's StepTemplate to init containers. However, the StepTemplate
is meant to be used for Steps, which likely have higher resource requirements than Tekton's init containers,
and this would require users to understand what Tekton's init containers do.

We could try to set resource requirements dynamically based on the number of Steps, the size of scripts, or other parameters.

We could apply resource requirements to init containers only when ResourceQuotas are present in the namespace.

The third point could be something indeed. I think it might be a mix of several of those. We should give the user a way to specify default value, and also, most likely, a "mode" of working (like "set only if ResourceQuotas or LimitRange", "set always", "set never", …)

Future work

Combine Tekton init containers into one init container (Can the "place-tools" and "place-scripts" init-containers merge into one init-container? #4519).

In theory this is less important I think, as init containers are supposed to run in parallel and thus putting all the things in one init container shouldn't change too much — but this is all theory though 😛. The trick of merging them all in one init container is that it might make the value of requests/limits harder to decide.

Overall, I think we should at least give some options to the users on whether they want/need to have requests/limits on the init containers (and most likely their values, with us providing a decent default) 👼🏼 👍🏼

lbernick · 2022-02-14T14:14:54Z

Thanks for the feedback @vdemeester @afrittoli!

This side effect seems to be a perfect candidate to be included in the release notes and probably in the docs somewhere?

Added to release notes!

By default they would be "burstable" and if you have LimitRange in your namespace, they would become "Guaranteed"

I think for guaranteed QoS you need request = limit, and my reading of our code/docs on limit range is that we set the highest limits and lowest requests allowed by the limit range.

Another drawback, at least as the PR is done, is that, it could create Pod that are invalid from the LimitRange standpoint.

Happy to change the limit range logic for init containers as part of this PR! I don't think we provide a way for users to specify init containers so all the init containers should be controlled by Tekton.

Setting it higher can also render the Pod unschedulable anywhere. I think this is one reason for making those requests/limits configurable.

Good point, added to "drawbacks".

Combine Tekton init containers into one init container (Can the "place-tools" and "place-scripts" init-containers merge into one init-container? #4519).

In theory this is less important I think.

I agree, even if we have only one init container we will still run into the same problems here.

TBH, I don't think this PR should be merged, at least until we figure out a way around some of these drawbacks. The resource requirements I've used here are mostly guesses; do you have any suggestions on how we could get better estimates?

It sounds like both of you see some value in allowing the user to configure some of this behavior. What would this configuration look like ideally?

lbernick · 2022-04-12T14:01:12Z

@wlynch suggested that we could pull these default values from a configmap, and that users could modify the configmap if there are any issues with the defaults provided. The idea here would be that most users don't have to interact with this configmap, and would only use it if the defaults caused a problem or if they wanted to do something more specific such as a guaranteed qos for pods.

Going to close this PR as I'm not actively working on it but this can easily be reopened.

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Feb 8, 2022

tekton-robot requested review from imjasonh and vdemeester February 8, 2022 16:06

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 8, 2022

lbernick mentioned this pull request Apr 12, 2022

How to make a pod to have QoS guaranteed? #4533

Closed

lbernick closed this Apr 12, 2022

lbernick mentioned this pull request Jun 30, 2022

Support setting compute for init containers #5059

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add default k8s resources for init containers #4552

[RFC] Add default k8s resources for init containers #4552

lbernick commented Feb 8, 2022 •

edited

Loading

tekton-robot commented Feb 8, 2022

lbernick commented Feb 8, 2022

lbernick commented Feb 8, 2022

tekton-robot commented Feb 8, 2022

tekton-robot commented Feb 8, 2022

afrittoli commented Feb 9, 2022 •

edited

Loading

afrittoli commented Feb 9, 2022

vdemeester commented Feb 14, 2022 •

edited

Loading

lbernick commented Feb 14, 2022 •

edited

Loading

lbernick commented Apr 12, 2022

[RFC] Add default k8s resources for init containers #4552

[RFC] Add default k8s resources for init containers #4552

Conversation

lbernick commented Feb 8, 2022 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 8, 2022

lbernick commented Feb 8, 2022

lbernick commented Feb 8, 2022

tekton-robot commented Feb 8, 2022

tekton-robot commented Feb 8, 2022

afrittoli commented Feb 9, 2022 • edited Loading

afrittoli commented Feb 9, 2022

vdemeester commented Feb 14, 2022 • edited Loading

lbernick commented Feb 14, 2022 • edited Loading

lbernick commented Apr 12, 2022

lbernick commented Feb 8, 2022 •

edited

Loading

afrittoli commented Feb 9, 2022 •

edited

Loading

vdemeester commented Feb 14, 2022 •

edited

Loading

lbernick commented Feb 14, 2022 •

edited

Loading