Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add “cat” to base image - adds the ability to query missing cert file #4109

Closed
tshaiman opened this issue Jan 15, 2023 · 20 comments
Closed
Labels
feature-request All issues for new features that have not been committed to needs-discussion stale All issues that are marked as stale due to inactivity

Comments

@tshaiman
Copy link

Proposal

following the issue where azure identity webhook did not mutate keda pods during node replacement
we need a better health check mechanism to enforce pod restart in case the cert is missing

today we cannot do healthcheck based on linux cat command as it is not part of the distroless base image .

please consider adding cat to the distroless image :

—copy-from busybox:xxx /bin/cat /bin/cat

Use-Case

No response

Is this a feature you are interested in implementing yourself?

No

Anything else?

#3977

@tshaiman tshaiman added feature-request All issues for new features that have not been committed to needs-discussion labels Jan 15, 2023
@tomkerkhove
Copy link
Member

Thoughts @kedacore/keda-maintainers ?

@JorTurFer
Copy link
Member

I don't like the idea TBH, using cat an external user can get internal information, we use distroless image for reducing as much as possible the attack vector, introducing executables for edge cases doesn't make sense IMO, we should discuss how to execute a better probes based on the resources we are managing, but I wouldn't add any extra binary to the image, but I won't block this if other folks think that adding it is worth

@tshaiman
Copy link
Author

we need something .
empty healthy check and disability to use our own check whether the cert exists cause our usage of Keda with all the Workload Identity issues described in other tickets as non-production grade product .

@zroubalik
Copy link
Member

I agree with @JorTurFer on this, we should keep the image minimal as much as possible. @tshaiman but I think I can say that we are open to other suggestions.

@tshaiman
Copy link
Author

as mentioned in previous tickets - we need a reliable health check . you don’t agree to add logic to the health check in code , and you don’t to allow us to inject the health check file with cat
in other words , keda contributors rather the health probe to always return “healthy” even when it’s not .
i have no other suggestions than to change this policy which is totally wrong in terms of reliability ( reporting something is healthy while it’s not)

@zroubalik
Copy link
Member

you don’t agree to add logic to the health check in code

where we don't agree on this?

@JorTurFer
Copy link
Member

JorTurFer commented Jan 17, 2023

I have some points about that. KEDA works, providing the external variables is responsibility of a 3rd party. We only said that we don't like the idea of adding extra binaries to the image only for solving an edge case. I also said that adding these values checking as part of the KEDA probes is dangerous because in your use case it's critical, but other use cases don't require it and other users can prefer to fail only in 1 ScaledObject than failing all the time in case of persistent errors related with WI because if the trouble is not transitory, KEDA won't work because it and maybe there are 1 or 2 ScaledObjects with that dependency and a huge amount without it.

I'm not against adding a dynamic probes inside KEDA based for example in a new parameter in the TriggerAuthentication, or as a general parameter (maybe better as general).

But I'd like to add that IMHO this issue should be fixed (and can be, indeed) in 3rd party side. K8s webhooks have a property named failurePolicy with 2 possible options:

  • Ignore: The validation/mutation will continue without any error if the webhooks server isn't available (current case, producing the fail)
  • Fail: The validation/mutation will fail webhooks server isn't available (blocking KEDA scheduling without the needed resources)

IDK if the managed WI add-on supports it or not but helm does it. Setting that value to Fail with a proper namespaceSelector to avoid locks (because the webhooks server can't be scheduled with itself working) can fix the issue better than adding a probe to restart KEDA if WI fails

@tshaiman
Copy link
Author

@JorTurFer : I agree the root cause is with the failurePolicy and as far as I heard from WIF people, the fix is on its way.
my remark is more of a general one : in cloud native apps , you need to allow customer to inject their own health -checks
and if we can't do that , this is a bit of violation of the contract we have with k8s.
adding "cat" utility should not break the security barriers if we only add "cat" and not /bin/sh

@JorTurFer
Copy link
Member

JorTurFer commented Jan 17, 2023

AFAIK, cat can execute arbitrary code, if an attacker mounts a compromised file and read it with cat, they could execute arbitrary code in the pod, with all the privileges that the pod has (usually, the pod has privileged access to a lot of places).

WRT the option of injecting custom probes, most probably I'm wrong, but I have just checked some helm charts in artifact hub (the top 3) and these are the results:

  • kube-prometheus-stack: Doesn't support any change in probes
  • cert-manager: Only supports the configuration of the general values like timeout or threshold. probe type and path are locked
  • nginx: supports changes in all the values, except the type, it's always httpGet. (only for controller, default backend is restricted only to general values like timeout and threshold)

I have checked only the top 3, maybe other charts are more flexible about the probes, but even if we supported the most flexible case of them (any parameter except the probe type) , it wouldn't be enough for your use case as you need a change from httpGet to exec.

I still think that adding a parameter for enabling a check as part of the already existing health check is better. Maybe as a first approach, we can just set an environment variable (helm allows setting extra envs given by user) and if that env is true, we can add the check in the current health checks. This can give you the option to restart KEDA if you think that it's critical, and other users have the option to don't do it if they think the opposite.
Something like PROBE_AAD_IDENTITYI, PROBE_AWS_IDENTITY and PROBE_GCP_IDENTITY for adding those 3 checks conditionally to KEDA checks (Aad-Pod-Identity can't be tested in this way AFAIK because it doesn't inject anything, only federated authentications do it)

WDYT about this @kedacore/keda-core-contributors ?

@tomkerkhove
Copy link
Member

I can see why we want to stick with distroless image; but let's see how we can otherwise help with docs and guidance.

For health checks; I think it depends on what we want to achieve here and what defines "healthy". Maybe a side-car approach can be feasible.

@tshaiman
Copy link
Author

a side car cannot assist here as if it will detect the missing cert and report in healthy - that would cause the side car container to restart and not the keda container to restart

@tshaiman
Copy link
Author

keda unhealthy == it cannot authenticate / cannot access the token file

that i is good enough KPI in my opinion

@zroubalik
Copy link
Member

I think this check should be independent of KEDA deployment. The probes should inform about health condition of KEDA operator, not one of scalers or auth providers.

This way on a shared cluster, one user can bring KEDA down, just by using incorrect auth settings.

We should provide additional ways how to inform about this. Maybe a status on TriggerAuth? Or a new resource that will report the status?

@tomkerkhove
Copy link
Member

We should provide additional ways how to inform about this.

I think this is the sweet spot combined with #3764.

Then you can both alert based on logs + event on the TriggerAuthentication resource's events

@tomkerkhove
Copy link
Member

Actually this is already tracked in #3629

@zroubalik
Copy link
Member

@tshaiman are you willing to draft a proposal using approach mentioned above?

@stale
Copy link

stale bot commented Mar 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Mar 26, 2023
@stale
Copy link

stale bot commented Apr 2, 2023

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Apr 2, 2023
@github-project-automation github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Apr 2, 2023
@JorTurFer JorTurFer reopened this Apr 3, 2023
@github-project-automation github-project-automation bot moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Apr 3, 2023
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Apr 3, 2023
@stale
Copy link

stale bot commented Jun 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jun 8, 2023
@stale
Copy link

stale bot commented Jun 15, 2023

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Jun 15, 2023
@github-project-automation github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to needs-discussion stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

4 participants