Ability to debug nodes with running debug container #8720

kvaps · 2024-05-08T10:32:47Z

Feature request

Description

It would be nice to relaise API interface and command for talosctl to debug.

It might be done the same way as kubectl debug:

# Create an interactive debugging session in pod mypod and immediately attach to it.
talosctl debug mypod -n 11.22.33.44 -e 11.22.33.44 -it --image=busybox -- /bin/sh

Or to proxy CRI socket the same way how ssh agent works:

 eval `talosctl proxy-cri -n 11.22.33.44 -e 11.22.33.44`

which outputs:

export CONTAINER_RUNTIME_ENDPOINT=unix://var/run/talosctl/38ab4953-994e-4517-bf4f-43ad8b5d2b38.sock

then use crictl to run debug CRI:

# crictl ps | grep kube-apiserver
3ff4626a9f10e       e7972205b6614       6 hours ago         Running             kube-apiserver         0                   215107b47bd7e       kube-apiserver-talos-rzq-nkg

Or run a container:

# Run a container:
crictl run docker.io/library/busybox:latest

# Attach a shell to a running container:
crictl exec -it my-container sh

The text was updated successfully, but these errors were encountered:

smira · 2024-05-08T16:02:00Z

Idea from planning:

allow talosctl to pull any image on the user machine and push (upload) it to Talos containerd image storage
make talosctl exec into that container image in privileged mode
once done, clean up everything

This allows the command to work even if Talos machine can't pull any image from the registry at the moment, and any custom image can be pushed.

Example: talosctl debug alpine:3.19

Talos APIs:

push an image to the containerd image store
remove an image (already have it?)
create a container
exec into it

Maintenance mode:

disabled by default
enabled via SideroLink
enabled if some kernel arg is set (for debugging)

smira · 2024-05-08T16:04:37Z

To prevent any changes, probably mount host fs as read-only (?).

Add a kernel arg to completely disable the feature (?).

andrewrynhard · 2024-05-08T23:59:10Z

I think we should hold off on this for now. IMHO not having something like this is the point of Talos, really. I completely understand the urge to have a quick win but something like this almost immediately breaks our whole stance with Talos Linux. Could we better understand the use case? If we are going to be an API Linux, let’s be an API Linux and figure out what we are missing for the use case. The dashboard could be a place to put more local debug tooling. This feature will absolutely be abused it makes us look a weak in our stance.

@kvaps What is/are the scenario(s) in which you think this would be used?

kvaps · 2024-05-09T07:07:29Z

My story began with the missed opportunity to run standard debugging tools such as ping, arping, curl, as well tcpdump and pwru on a bare-metal server when the Kubernetes API was inaccessible.

Also, I have some scenarios for debugging specific CRI containers, for example, entering various Linux kernel namespaces and running thesee tools there. See the approach suggested by my kubectl node-shell plugin:

https://github.com/kvaps/kubectl-node-shell/?tab=readme-ov-file#x-mode

runningman84 · 2024-05-09T08:19:31Z

I have a debug daemonset running for the same reason. But if the kubelet does not run you cannot do anything to fix it. In my case things like zfs might be wrongly configured and need maintenance by an admin.

smira · 2024-05-09T10:35:35Z

I think there's nothing wrong with the APIs to run containers on Talos, as all Talos & Kubernetes do is run containers.

kubectl debug is effectively same, but requires whole Kubernetes API stack to be running and healthy, and not properly scoped to the Talos API credentials, while talosctl debug is properly scoped to the API. So I don't think it breaks anything in the way Talos design goals are.

andrewrynhard · 2024-05-09T11:09:19Z

I think there's nothing wrong with the APIs to run containers on Talos, as all Talos & Kubernetes do is run containers.

kubectl debug is effectively same, but requires whole Kubernetes API stack to be running and healthy, and not properly scoped to the Talos API credentials, while talosctl debug is properly scoped to the API. So I don't think it breaks anything in the way Talos design goals are.

It 100% breaks the design goals I had in mind when creating Talos. One goal was to do APIs and over time add the APIs we need to replace as much of the user space as we could. Adding a debug container is going to open the door to possible attacks, be abused and set a precedence for lazy practices, and make us less motivated to add APIs. Everything can just be dumped off into a debug container.

A debug container makes you ask if an API is even needed in the first place. The Talos “API” could literally just be apply and debug in that case. It tears apart our whole argument for having an API. We claim day in and day out in conversations with users and customers that there is a better future for infrastructure if we have APIs at the OS layer. A debug container doesn’t exactly portray confidence in that statement.

I want to be pragmatic here and I completely understand the use case but philosophically this isn’t Talos. We should rather be asking what APIs we can add, what operational knowledge can we build in, what information can we expose, and/or can we automatically resolve the issue.

There will be edge cases that become painful without a debug container, I completely understand that and I don’t want to tell anyone to just deal with it, but if we don’t have those pains we will never grow the Talos API and the automation goals it has, and worst of all it tears apart the whole argument for having an API in the first place.

andrewrynhard · 2024-05-09T11:20:53Z

One option today could be a system extensions that runs a "debug container" with SSH enabled. Run it all the time if you really want these tools. The new ability to configure an extension could allow for adding allowed keys. This would address the corner cases, work today, and not break down the Talos Linux arguments we make day in and day out as it (a debug container) wouldn't be something we support first class. It is essentially the same but it isn't endorsed nor encouraged.

smira · 2024-05-09T11:29:53Z

Running containers is a basic feature of Talos, and I don't think adding this to the API breaks any promise, or blocks the development of the APIs going forward. The proposed here is to add APIs to run containers, and connect to their stdin/stdout/stderr (docker exec is an example of that). In general, it is continuing the trend of exposing containerd APIs via Talos API, pretty much like Talos API already exposes some of the containerd APIs (listing containers, getting logs, listing images, etc.) So it is actually aligned to the Talos APIs, not going against it.

A container running is still sandboxed with some set of permissions of what the container can actually do.

I can understand the emotional reaction, but it's more about the way thing are being used vs. having or not having some feature.

"Regular" Linux distro offers tools to do tons of things, but if systemd does it better, nobody is going to use these tools if systemd does it. In case of disaster/bug a tool might be the way to fix things. In the same spirit if Talos API provides something, nobody is going to use talosctl debug to do same thing, but talosctl debug might fill the gap when it's an edge case that is rare/hard to wrap.

andrewrynhard · 2024-05-09T12:48:01Z

Running containers is a basic feature of Talos, and I don't think adding this to the API breaks any promise, or blocks the development of the APIs going forward. The proposed here is to add APIs to run containers, and connect to their stdin/stdout/stderr (docker exec is an example of that). In general, it is continuing the trend of exposing containerd APIs via Talos API, pretty much like Talos API already exposes some of the containerd APIs (listing containers, getting logs, listing images, etc.) So it is actually aligned to the Talos APIs, not going against it.

A container running is still sandboxed with some set of permissions of what the container can actually do.

I can understand the emotional reaction, but it's more about the way thing are being used vs. having or not having some feature.

"Regular" Linux distro offers tools to do tons of things, but if systemd does it better, nobody is going to use these tools if systemd does it. In case of disaster/bug a tool might be the way to fix things. In the same spirit if Talos API provides something, nobody is going to use talosctl debug to do same thing, but talosctl debug might fill the gap when it's an edge case that is rare/hard to wrap.

Systemd offers a whole lot more than we do today, yet, here we are with very large companies using us and loving us and a community growing daily with zero marketing. The philosophy and stance of Talos Linux is just as important as the technical implementation. Talos Linux is a statement: we need to do infrastructure better and with APIs for everything.

I agree debug will fill a gap, but I am also 100% convinced it will be enabled by everyone as a thing you just do just like we all setenforce 0 when we install distros with SELinux. It is a slippery slope. People will want and make a good argument for needing more from debug over time and then eventually we end up with something extremely close to a shell and all the while not moving the needle and building out APIs. We will put time and energy into it when we could have put time and energy into building what is really needed here: collecting data and a way to act on that data. An API is perfect for that.

I would be more than happy to talk more and would invite a deeper discussion around this but as of now I don't see this coming to Talos Linux. I don't think there is a right and wrong in this situation so this subject makes it very easy to take a strong stance on either side and feel like the other is wrong. To be clear, I don't thing this is wrong from a purely technical PoV but there are bigger things at play here. In fact I was excited about this idea when I first thought about it but over the course of a day other things began to break down around it.

kvaps · 2024-05-09T16:34:57Z

@andrewrynhard, we already have the cat command; how about adding another one called socat? Kubernetes uses it to implement proxy and port-forward in their API.

We could do the same, which would enable us to debug CRI. For example:

taosctl socat TCP-LISTEN:8078,reuseaddr,fork UNIX-CLIENT:/run/containerd/containerd.sock
export CONTAINER_RUNTIME_ENDPOINT=tcp:///127.0.0.1:8078

smira · 2024-05-09T16:44:11Z

@andrewrynhard, we already have the cat command; how about adding another one called socat? Kubernetes uses it to implement proxy and port-forward in their API.

not Andrew, I feel this might be powerful, but this is too much unconstrained access which we can't impose any limits on. API-level access has its limits which we can enforce, raw socket is all or nothing, plus as @rothgar pointed out it requires the user to have crictl tools on their machine and expertise on using it

andrewrynhard · 2024-05-09T17:37:49Z

@andrewrynhard, we already have the cat command; how about adding another one called socat? Kubernetes uses it to implement proxy and port-forward in their API.

not Andrew, I feel this might be powerful, but this is too much unconstrained access which we can't impose any limits on. API-level access has its limits which we can enforce, raw socket is all or nothing, plus as @rothgar pointed out it requires the user to have crictl tools on their machine and expertise on using it

I would agree. Debug containers are powerful too. And I see the need and desire for both of these ideas. I really do.

As Andrey points out, a reason we can do the level of automation we do and offer the level of security we do is because Talos imposes limits. I don't want those limits to be so restrictive that we start to lose adoption but I also want to stick to our principles.

This is a tough situation. What we have done in the past is waited for the right idea to come about and we were always happy we didn't rush to fix problem X if we didn't think any existing solutions would stay within the Talos ethos.

@kvaps Maybe we can start with what you needed specifically within the containerd API.

andrewrynhard · 2024-05-09T17:41:06Z

I just want to point out one of the main goals of Talos: keep humans off a machine and from breaking things. That is literally why the API exists. At one point I had the kernel running the kubelet as PID1 and it was impractical. I was faced with a decision. Drop in a shell and lose my goal of removing humans from the machine or find another way. That is when the API idea came about. We should strive to push Talos towards this goal IMHO.

kvaps · 2024-05-14T06:41:14Z

I totally understand, but in this case we have to cover everything with the API.

Especially me need the commands like ping, arping, tcpdump, curl and in some cases pwru and nsenter, as well tools from iproute2 package. What suppose to be the correct way for running them if Kubernetes API is not accessible anymore?

thomasdba · 2024-06-20T02:39:37Z

debug is a special case , or a special mode , it is an option to give users . Everything looking like a nail to someone with a hammer. If hammer can handle everything ,that is fine, but the truth is not . For special cases ,I think it is okay to use speical/smart tools to handle .

rothgar · 2024-10-02T21:58:36Z

For now we're not going to implement an easy way to do this via talosctl. There are still some other ideas we're thinking about that could provide similar debug access but nothing is planned right now.

I'm going to close this issue because we'll need to think about other ways we can implement this without giving raw/open access that bypasses the API.

smira mentioned this issue May 8, 2024

Talos 1.8 Release Checklist ✔️ #8484

Closed

smira mentioned this issue Aug 30, 2024

Talos 1.9 Release Checklist ✅ #9249

Open

smira mentioned this issue Sep 18, 2024

FR: Ability to Open a Debug Container via Talos Dashboard for Out-of-Band Access (Optional via System Extensions) #9332

Open

rothgar closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to debug nodes with running debug container #8720

Ability to debug nodes with running debug container #8720

kvaps commented May 8, 2024 •

edited

Loading

smira commented May 8, 2024

smira commented May 8, 2024 •

edited

Loading

andrewrynhard commented May 8, 2024 •

edited

Loading

kvaps commented May 9, 2024

runningman84 commented May 9, 2024 •

edited

Loading

smira commented May 9, 2024

andrewrynhard commented May 9, 2024

andrewrynhard commented May 9, 2024 •

edited

Loading

smira commented May 9, 2024

andrewrynhard commented May 9, 2024

kvaps commented May 9, 2024 •

edited

Loading

smira commented May 9, 2024 •

edited

Loading

andrewrynhard commented May 9, 2024

andrewrynhard commented May 9, 2024

kvaps commented May 14, 2024

thomasdba commented Jun 20, 2024

rothgar commented Oct 2, 2024

Ability to debug nodes with running debug container #8720

Ability to debug nodes with running debug container #8720

Comments

kvaps commented May 8, 2024 • edited Loading

Feature request

Description

smira commented May 8, 2024

smira commented May 8, 2024 • edited Loading

andrewrynhard commented May 8, 2024 • edited Loading

kvaps commented May 9, 2024

runningman84 commented May 9, 2024 • edited Loading

smira commented May 9, 2024

andrewrynhard commented May 9, 2024

andrewrynhard commented May 9, 2024 • edited Loading

smira commented May 9, 2024

andrewrynhard commented May 9, 2024

kvaps commented May 9, 2024 • edited Loading

smira commented May 9, 2024 • edited Loading

andrewrynhard commented May 9, 2024

andrewrynhard commented May 9, 2024

kvaps commented May 14, 2024

thomasdba commented Jun 20, 2024

rothgar commented Oct 2, 2024

kvaps commented May 8, 2024 •

edited

Loading

smira commented May 8, 2024 •

edited

Loading

andrewrynhard commented May 8, 2024 •

edited

Loading

runningman84 commented May 9, 2024 •

edited

Loading

andrewrynhard commented May 9, 2024 •

edited

Loading

kvaps commented May 9, 2024 •

edited

Loading

smira commented May 9, 2024 •

edited

Loading