Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: Address all concerns raised #1980

Merged
merged 16 commits into from
Oct 21, 2020

Conversation

rata
Copy link
Member

@rata rata commented Sep 11, 2020

Hi!

As we talked on the last SIG-node, here it goes a PR solving all the concerns with the proposed suggestion previously outlined (in some cases something simpler, as suggested in the last SIG-node call too :)).

The complete diff looks big and is hard to follow (I thought about using smaller PRs, but got many git conflicts). I tried to split on different logical changes so it can be easily reviewed. If that way of reviewing is not of your preference, another way to see the final result easily is also using the "show this file" github option. If you want me to split the changes in different PRs anyways, just let me know and I can maybe split this in more than one PR to open secuentially :)

Thanks again for your time and patience :)

cc @Joseph-Irving

rata added 15 commits September 10, 2020 15:38
This was requested by Sergey here:
	kubernetes#1913 (comment)

Signed-off-by: Rodrigo Campos <[email protected]>
One decision to revisit was to see if really wanted to change the pod
phase. It was clearly agreed here that we won't:
	kubernetes#1913 (comment)

This commit just removes the section to discuss that (it is already
agreed) and updates the KEP to reflect that.

Signed-off-by: Rodrigo Campos <[email protected]>
This was added due to a comment from Istio long ago[1], but they don't
need this anymore[2]. Furthermore, our use cases at Kinvolk also work
just fine without this.

[1]: kubernetes/community#2148 (comment)
[2]: kubernetes#1913 (comment)

Signed-off-by: Rodrigo Campos <[email protected]>
This commit just move some section to caveats with a clarification on
why it seems enough to do this on each.

Signed-off-by: Rodrigo Campos <[email protected]>
These concerns have been discussed:
 * "No container type standard". As pointed, this was already discussed
   and agreed in the past.
 * "Is this change really worth doing?". It seems it is, given the
   community response and the problems stated there can only be solved
   at the Kubernetes layer
 * "Why this design seems extensible?". Didn't present any major concerns
   and it does seem extensible.
 * "Sidecar containers won't be available during initContainers phase".
   We had a VC with Sergey and Joseph and agreed that it is out of scope
   for this KEP.

Signed-off-by: Rodrigo Campos <[email protected]>
As discussed in SIG-node on 08.09.2020 and suggested by Sergey, we can
just document the caveat that it might take 4s longer to kill a pod with
sidecar containers.

This is very simple, even simpler than the alternatives listed.

Signed-off-by: Rodrigo Campos <[email protected]>
This was suggested by Sergey here:
	kubernetes#1913 (comment)

Sadly, that branch is merged and can't click on commit suggestion now.
Credits goes to Sergey anyways :-)

Signed-off-by: Rodrigo Campos <[email protected]>
In the previous PR, istio devs commented that some things were not
accurate. This commit just updates the text to (hopefully) correctly
reflect it now.

Removed the paragraph about this removing the need for an initContainer
due to comment here:
	kubernetes#1913 (comment)

I thought it was an okay to insert the iptables rules within the sidecar
proxy container, but it is not okay as that requires more permissions
(capabilities) on the sidecar proxy container which is not considered
accetable by Istio devs.

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Rodrigo Campos <[email protected]>
After the last changes to the KEP, this doesn't apply anymore: the
podPhase is not modified.

Furthermore, the preStop hooks are sent once but the PoC wasn't updated
to reflect that. To avoid confussions, call it out there.

Signed-off-by: Rodrigo Campos <[email protected]>
There were two sections with caveats in their name otherwise. Also, this
is more similar to the latest KEP template.

The diff looks tricky just because the text is being moved 2 sections
below. However, git creates a diff showing the unchanged section is
moved up instead.

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Rodrigo Campos <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 11, 2020
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 11, 2020
@rata
Copy link
Member Author

rata commented Sep 11, 2020

Oh, I thought the bot will assign to @SergeyKanzhelev and @sjenning too. Sorry, let me try with the cc command now.

/cc @SergeyKanzhelev @sjenning

This was requested by Tim some months ago:
	kubernetes/kubernetes#79649 (comment)

I didn't get that it should be removed, I just thought it should not be
set by default. But Joseph Irving just clarified that for me:
	kubernetes#1980 (comment)

Signed-off-by: Rodrigo Campos <[email protected]>
@kikisdeliveryservice kikisdeliveryservice mentioned this pull request Sep 12, 2020
11 tasks
Copy link

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

In other words, Istio has an alternative to configure the traffic blockhole
without an initContainer. But the other problems and hacks mentioned remain,
though.
Such rules are usually inserted as an initContainer (trying to run last, to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that may be important is that no matter what ordering we do, its still broken to some extent. While you may be able to bypass the sidecar, you lose all it gives you. This includes routing rules (may see unexpected traffic), telemetry (no visibility) and typically/ideally, when using a service mesh all pods will enforce mTLS, so you will not be able to communicate with any pods in the mesh.

I know this is out of scope but I just wanted to clarify the workarounds are not sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good point, thanks! I'll see how to clarify the text to highlight that those are the workarounds in use today, but they are not enough.

But sure, it is out of scope for this KEP :)

@derekwaynecarr
Copy link
Member

I want to capture my latest thoughts on sidecar concepts, and get a path forward.

Here is my latest thinking:

I think it’s important to ask if the introduction of sidecar containers will actually address an end-user requirement or just shift a problem and further constrain adoption of sidecars themselves by pod authors. To help frame this exercise, I will look at the proposed use of sidecar containers in the service mesh community.

User story
I want to enable mTLS for all traffic in my mesh because my auditor demands it.

The proposed solution is the introduction of sidecar containers that change the pod lifecycle:

  1. Init containers start/stop
  2. Sidecar containers start
  3. Primary containers start/stop
  4. Sidecar containers stop

The issue with the proposed solution meeting the user story is as follows:

  • Init containers are not subject to service mesh because the proxy is not running. This is because init containers run to completion before starting the next container. Many users do network interaction that should be subject to the mesh in their init container.

  • Sidecar containers (once introduced) will be used by users for use cases unrelated to the mesh, but subject to the mesh. The proposal makes no semantic guarantees on ordering among sidecars. Similar to init containers, this means sidecars are not guaranteed to participate in the mesh.

The real requirement is that the proxy container MUST stop last even among sidecars if those sidecars require network.

Similar to the behavior observed with init containers (users externalize run-once setup from their main application container), the introduction of sidecar containers will result in more elements of the application getting externalized into sidecars, but those elements will still desire to be part of the mesh when they require a network. Hence, we are just shifting, and not solving the problem.

Given the above gaps, I feel we are not actually solving a primary requirement that would drive improved adoption of a service mesh (ensure all traffic is mTLS from my pod) to meet auditing.

Alternative proposal:

  • Support an ordered graph among containers in the pod (it’s inevitable), possibly with N rings of runlevels?
  • Identify which containers in that graph must run to completion before initiating termination (Job use case).
  • Move init containers into the graph (collapse the concept)
  • Have some way to express if a network is required by the container to act as a hint for the mesh community on where to inject a proxy in the graph.

A few other notes based on Red Hat's experience with service mesh:

Red Hat does not support injection of privileged sidecar containers and will always require CNI approach. In this flow, the CNI runs, multus runs, iptables are setup, and then init containers start. The iptables rules are setup, but no proxy is running, so init containers lose connectivity. Users are unhappy that init containers are not participating in the mesh. Users should not have to sacrifice usage of an init container (or any aspect of the pod lifecycle) to fulfill auditor requirements. The API should be flexible enough to support graceful introduction in the right level of a intra pod life-cycle graph transparent to the user.

Proposed next steps:

  • Get a dedicated set of working meetings to ensure that across the mesh and kubernetes community, we can meet a users auditing requirement without limiting usage or adoption of init containers and/or sidecar containers themselves by pod authors.
  • I will send a doodle.

/cc @thockin @smarterclayton @dchen1107 @sjenning @liggitt given past involvement in this space.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 24, 2020

Support an ordered graph among containers in the pod (it’s inevitable), possibly with N rings of runlevels?

Agree! We might as well tackle this general problem vs. doing it step by step with baggage added along the way.

@sjenning
Copy link
Contributor

sjenning commented Sep 24, 2020

I agree @derekwaynecarr

I think that in order to satisfy fully the use cases mentioned, we are gravitating toward systemd level semantics where there is just an ordered graph of services containers in the pod spec.

You could basically collapse init containers into the normal containers map and add two fields to Container; oneshot bool that expresses if the container terminates and dependent containers should wait for it to terminate (handles init containers w/ ordering), and requires map[string] a list of container names upon which the current container depends.

This is flexible enough to accommodate a oneshot: true container (init container) depending on a oneshot: false container (a proxy container on which the init container depends).

Admittedly this would be quite the undertaking and there is API compatibility to consider.

@thockin
Copy link
Member

thockin commented Sep 24, 2020

I have also been thinking about this. There are a number of open issues, feature-requests, etc that all circle around the topic of pod and container lifecycle. I've been a vocal opponent of complex API here, but it's clear that what we have is inadequate.

When we consider init-container x sidecar-container, it is clear we will inevitably eventually need an init-sidecar.

Some (non-exhaustive) of the other related topics:

  • Node shutdown -> Pod shutdown (in progress?)
  • Voluntary pod restart ("Something bad happened, please burn me down to the ground and start over")
  • Voluntary pod failure ("I know something you don't, and I can't run here - please terminate me and do not retry")
  • "Critical" or "Keystone" containers ("when this container exits, the rest should be stopped")
  • Startup/shutdown phases with well-defined semantics (e.g. "phase 0 has no network")
  • Mixed restart policies in a pod (e.g. helper container which runs and terminates)
  • Clearer interaction between pod, network, and device plugins

We need to make it possible for admission controls to inject sidecars without knowing about each other, but we also need to be careful how much we throw on end-users or cluster-admins. The solution is very unclear to me. We need to handle multiple uncoordinated sidecars that demand to be "first". We need to compose with loose coupling.

@thockin
Copy link
Member

thockin commented Sep 24, 2020

This is a big enough topic that we almost certainly need to explore multiple avenues before we can have confidence in any one.

Assigning to myself, but only so I can track email better :)

@thockin thockin self-assigned this Sep 24, 2020
@derekwaynecarr derekwaynecarr self-assigned this Sep 24, 2020
@kfox1111
Copy link

the dependency idea also would allow for doing an init container, then a sidecar network plugin, then more init containers, etc, which has some nice features.

Also the readyness checks and oneshot could all play together with the dependencies so the next steps aren't started before ready.

So, as a user experience, I think that api might be very nice.

Implementation wise there are probably lots of edge cases to carefully consider there.

@SergeyKanzhelev
Copy link
Member

@derekwaynecarr this is great idea to set up a working group to move it forward in bigger scope. One topic I suggest we cover early on in the discussions is whether we need to address the existing pain point of injecting sidecars in jobs in 1.20. This KEP intentionally limited the scope to just this - formalizing what people are already trying to do today with workarounds.

From Google side we also would love the bigger scope of a problem be addressed, but hope to address some immediate pain points early if possible. Either in current scope or slightly bigger.

@derekwaynecarr
Copy link
Member

@SergeyKanzhelev I would speculate that the dominant consumer of the job scenario is a job that required participation in a mesh to complete its task, and since I don't see much point in solving for the mesh use case (which I view as the primary motivator for defining side car semantics) for only one workload type, I would rather ensure a pattern that solves the problem in light of our common experience across mesh and k8s communities.

@rata
Copy link
Member Author

rata commented Sep 24, 2020

@derekwaynecarr to be honest, I'm doing this and for our use case that is not a problem. Our use cases for jobs are not just about service mesh, but for example containers that uploads files after the processing finished. For pods that run indefinitely, we have like 7-8 sidecars and only one is a service mesh and we don't need them during initContainers phase.

We might want that in the future, though. But is not the main use case for us now. This is why I think solving the problems for sidecar containers as we know them today (start after initContainers) has merit: solves race conditions on start/shutdown and allows jobs to use them. And as we talked on slack, if we decide to add other types later, this "type: sidecar" won't create a maintenance burden as it has a 1-1 matching with several semantics (like "DependsOn". If we add that, containers type sidecar just have a "DependsOn" containers type standard).

As I said, I'm fine working on any alternative. I don't think that service mesh is the only use case, in fact we don't need it and I'd rather have this in 1.20 than wait a long time for an extended scope, given that it doesn't present downsides if pod startup becomes a directed graph later. But I'll be happy to work on any alternative we agree on :-)

I'll send an email tomorrow (9pm here now :)) to coordinate a meeting, as we talked on slack: https://kubernetes.slack.com/archives/C0BP8PW9G/p1600966907009200.

Thanks again!

@derekwaynecarr
Copy link
Member

@rata we can discuss for sure. we each have unique use cases and needs. my personal opinion is incrementally designing an ordered init flow is going to cause future pain (like we now feel with init containers) versus just designing for flexible user-defined ordering. I think @sjenning proposal is worth contrasting to see if the cost for that approach is any different. Even if we do what was proposed here, we know the mesh community will not be satisfied, and so we will have another set of tweaks to just iterate to a generic outcome. Curious if others feel different.

@rata
Copy link
Member Author

rata commented Sep 25, 2020

Thanks! I update here too, as some assigned to get emails here :-)

I've just sent an email to have a meeting and coordinate how to proceed (here is the copy to the mail, I sent it to kubernetes-dev, sig-node and sig-arch):
https://groups.google.com/g/kubernetes-sig-node/c/w019G3R5VsQ/m/bbRDZTv5CAAJ

@rata
Copy link
Member Author

rata commented Oct 5, 2020

@derekwaynecarr please feel free to close or merge this PR.

I think merging is better because the KEP in master now has lot of open questions and this PR and for me is simpler to continue from the document after this PR (some things are cleaned-up, the sections modified in this PR are adapted to the new KEP format, etc.) as some things will need to be re-done if we don't merge.

Thanks again!

@Joseph-Irving
Copy link
Member

I think we should merge this but add a note saying this KEP has been rejected, it sounds like any way forward would involve a complete re-write, so instead we could just create new KEPs that reference this as a rejected alternative.

@derekwaynecarr
Copy link
Member

@rata @Joseph-Irving i agree merging the incremental updates makes sense for future reference.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, rata

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2020
@k8s-ci-robot k8s-ci-robot merged commit b7a365e into kubernetes:master Oct 21, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 21, 2020
@rata rata deleted the rata/sidecar-kep branch October 21, 2020 14:06
SergeyKanzhelev pushed a commit to SergeyKanzhelev/enhancements that referenced this pull request Jan 8, 2021
This was requested by Tim some months ago:
	kubernetes/kubernetes#79649 (comment)

I didn't get that it should be removed, I just thought it should not be
set by default. But Joseph Irving just clarified that for me:
	kubernetes#1980 (comment)

Signed-off-by: Rodrigo Campos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants