-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pursuing conventional systemd+podman interaction #6400
Comments
If I didn't hit it clearly, I did try to adopt 1.9.2. It requires a couple things (but ultimately does not work well). #6084 has some more information as well.
Starting this up, you can only see console output from the container by doing The |
you can get 1.8.2-2 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1479547 I'll save it to my fedorapeople page as well and send you the URL later. |
if you enable linger mode and there is already the user session running, is there any disadvantage in in installing the .service file into |
@giuseppe for you to be able to do that you'd need a shell for that "system" account. Above I'm creating the user as root with a Also, this suggestion doesn't address what I'm primarily asking for above: desiring console output from the running container to be seen by systemd/journald. Without being able to see the combination on:
You have an extremely hard time figuring out what is going on with the system (you have to look in multiple places to piece together the state of errors). |
@vrothberg The core ask here (viewing logs for systemd-managed Podman) seems to be a pretty valid one - our current forking= approach does break this, and I was thinking that it ought to be possible for the |
There was a very similar request by @lucab :coreos/fedora-coreos-docs#75 (comment) I was also thinking about the log driver 👍 |
@ashley-cui Could you look into the --log-driver changes? |
@storrgie I've been pursuing similar things recently. Do -d, and keep the forking. That alone should take care of all container logs showing up in journald, you just need to do As conmon will be providing those keys - CONTAINER_ID an CONTAINER_NAME. I've been doing lots of testing, basically what I've been doing is - start a container, generate the output to journald, then use You could use CONTAINER_TAG too... i.e. add --log-opt tag=WhateverYouWant and find it with If you want it to show under the unit, like I do, I do this: Note, my container is root, not rootless, and the host is running Flatcar. My guess is you can get similar results by possibly tweaking the cgroup-parent. By putting the processes under the cgroup, systemd finds that they're associated with a unit - but I'd expect conmon being in the correct cgroup SHOULD be all you need. The added benefit of running all the processes in the systemd service's cgroup is that bind mounted /dev/log ALSO associates to the unit file, automagically. You don't get the automagic CONTAINER_NAME from conmon journald records, but you DO get anything you put in the service file as a LogExtraField - so you could use that to find your logs as well. |
I'm running rootless containers on Fedora Server. I'm able to see logs using |
I really do not recommend running |
(There is also |
I see traffic on the mailing list from @rhatdan about an FAQ... I'm feeling more and more as I learn about this project that the idea this can "replace" docker is basically gimmicky at this stage. There is no clear golden pathway for running containers as daemons on systems with podman+systemd. It seems fraught with edge cases. I'd really love to see this ticket be taken seriously as I think there are a LOT of people trying to depart docker land and systemd+podman is a way to rid yourself of the docker monolithic daemon. |
I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start. |
@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:
I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it. As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a |
I believe the reason we can't auto-generate Type=notify is because things
are not good if the app in the container does not support it (Podman can
hang) but it should work if you set it (though I'm actually not sure if it
respects our PID files - if it acts like Type=simple in that respect it
will never be really safe to use.)
On the rest, I think the most important thing is getting logging via
Journald working properly. Some things like KillMode I do not expect to be
resolved, and I honestly don't view it as a problem - our design here is
different than typical services by necessity (running without a daemon
forced this), so we don't quite fit into the usual pattern Systemd expects.
Podman will still guarantee that things are cleaned up on stop, as we would
if we are not managed by Systemd.
…On Sat, Jun 13, 2020, 04:29 Luca Bruno ***@***.***> wrote:
@mheon <https://github.com/mheon> that would indeed help, but I'm not
sure that's going to solve much. For example, from the thread at
coreos/fedora-coreos-docs#75
<coreos/fedora-coreos-docs#75>, that content
currently exists in the form of a blog post
<https://www.redhat.com/sysadmin/podman-shareable-systemd-services> which
unfortunately is:
- already stale at this point (podman-generate does generate that unit
anymore)
- not really integrating well with systemd service handling (e.g.
journald, sd-notify, user setting, etc)
- somehow concerning/fragile (e.g. KillMode)
I think it would be better to first devise a podman mode which works well
when integrated in the systemd ecosystem, and only then document it.
As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do
use sd-notify in order to signal when they are actually initialized and
ready to start serving requests. For that kind of autoscale-friendly logic
to work, a Type=notify service unit would be required.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6400 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ>
.
|
On the user setting specifically - I still believe that is a an issue with
Systemd. We're in contact with the Systemd team to try and find a solution.
…On Sat, Jun 13, 2020, 10:11 Matthew Heon ***@***.***> wrote:
I believe the reason we can't auto-generate Type=notify is because things
are not good if the app in the container does not support it (Podman can
hang) but it should work if you set it (though I'm actually not sure if it
respects our PID files - if it acts like Type=simple in that respect it
will never be really safe to use.)
On the rest, I think the most important thing is getting logging via
Journald working properly. Some things like KillMode I do not expect to be
resolved, and I honestly don't view it as a problem - our design here is
different than typical services by necessity (running without a daemon
forced this), so we don't quite fit into the usual pattern Systemd expects.
Podman will still guarantee that things are cleaned up on stop, as we would
if we are not managed by Systemd.
On Sat, Jun 13, 2020, 04:29 Luca Bruno ***@***.***> wrote:
> @mheon <https://github.com/mheon> that would indeed help, but I'm not
> sure that's going to solve much. For example, from the thread at
> coreos/fedora-coreos-docs#75
> <coreos/fedora-coreos-docs#75>, that content
> currently exists in the form of a blog post
> <https://www.redhat.com/sysadmin/podman-shareable-systemd-services>
> which unfortunately is:
>
> - already stale at this point (podman-generate does generate that
> unit anymore)
> - not really integrating well with systemd service handling (e.g.
> journald, sd-notify, user setting, etc)
> - somehow concerning/fragile (e.g. KillMode)
>
> I think it would be better to first devise a podman mode which works well
> when integrated in the systemd ecosystem, and only then document it.
>
> As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do
> use sd-notify in order to signal when they are actually initialized and
> ready to start serving requests. For that kind of autoscale-friendly logic
> to work, a Type=notify service unit would be required.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#6400 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ>
> .
>
|
We got the User setting working, it was mainly a problem with -d, no? Unless there's something else outstanding, I think that's solved. Similarly, the journald log-driver works well for me... unless you try to log a tty, which would be a bad idea anyway, now that exec is fixed. Systemd integration isn't great with docker either - docker's log-driver is exactly analogous to what conmon does, docker's containers are launched by the daemon which puts them in another cgroup, unless you use cgroup-parent tricks, and sometimes getting the container to work right w.r.t. logging and groups requires hacks like systemd-docker which throws a hacky shim around sd-notify. So are we really saying podman+systemd is somehow worse? Or just not better? Because it seems better to me. Doesn't seem like Docker has a golden pathway either. I've run docker w/ cgroup-parent sharing the unit's cgroup and systemd-docker (even though it's unsupported) for over a year, and haven't had any problems with systemd and docker fighting. I'm not sure why podman would... but I defer to the experts. The only thing I have with docker now that I don't have with podman is bind mounting /dev/log works - because I put the docker container in the same cgroup as the unit. Without that, I'd need some sort of syslog proxy, which would probably have to live in conmon, and is a whole other discussion and probably only relevant to me. |
That's not accurate. We just updated the blog post last week and do that regularly. The units are still generated the same way. Once Podman v2 is out, we need to create some upstream docs as a living document and point the blog post there.
We only support
We've been discussing that already in depth. We want Podman to handle shutdown (and killing) and prevent signal races with systemd which does not know the order in which all processes should be killed.
|
I agree and made a similar conclusion last week when working with support on some issues. Once v2 is out (and all fixes are in), I'd love us to create a living upstream document that the blog post can link to. |
I opened #6604 to break out the logging discussion. |
@vrothberg thanks! I shouldn't have piled up more topics in here, sorry for that. |
No worries at all, @lucab! All input and feedback is much appreciated.
That would be great, sure. While we support sd-notify, we don't generate these types. Having a dedicated issue will help us agree on how such a unit should look like and eventually get that into upstream docs (and man pages). Thanks a lot! |
Since we're having this discussion, and there's plenty of talk about Killmode, and cgroups, and where things should reside - it makes sense to me that podman's integration with systemd already has a blueprint - that being systemd-nspawn. The [email protected] unit includes things like: KillMode=mixed This means (among other things) you end up with And systemd has no problem monitoring the supervisor Pid, I'm guessing because Delegate is set, and it's a sub-cgroup. nspawn has options like --slice, --property, --register, and --keep-unit - probably all of which should be implemented similarly in podman... and the caveats are already spelled out in the documentation. https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html nspawn also has options for the journal - how it's bind mounted and supported, plus setting the machine ID properly for those logs... etc. I'd imagine we'd want nspawn to be the template? |
And doing Delegate and sub-cgroups like that also means systemctl status knows the Main PID is the supervisor, but shows the full process tree including the payload clearly in the status output, and the service type is sd-notify, so I imagine it's talking back to systemd to let it know these things. |
For that matter I've wondered if it's possible to use/wrap/hack/mangle something into place to allow systemd-nspawn itself to be the OCI container runtime, instead of crun or runc. Moreso a thought experiment than anything else, but the key hangup seems to be nspawn wants a specific mount to use, which podman can provide since it already did all the work to create the appropriate overlay bindmount. Probably involves reading config.json and turning it into command line arguments? I'm unclear separation-wise which parts of the above fit into which parts of the execution lifecycle. |
There was talk about making On the Delegate change - I'd have to think more about what this means for containers which forward host cgroups into the container (we'll need a way to guarantee that the entire unit cgroup isn't forwarded). I also think we'll need to ensure that the container remembers it was started with cgroupfs, so that other Podman commands launched from outside the unit file that require cgroups (e.g. |
to simulate what nspawn does we'd need to tell the OCI runtime to use the cgroup already created by conmon instead of creating a new one. Next I think we can go a step further and get closer to what nspawn does by having a single cgroup for conmon+container payload |
@jdoss If you're using selinux, I suggest you compile and place the crun binary in /usr/local/bin, as that folder is recognized in the policy. If you're going to have a local podman or runc or crun it should be there, and chcon'd to match, i.e. In /etc/containers/containers.conf:
Or specify it on the command line as @giuseppe indicated. |
@goochjj and @giuseppe I just compiled crun from master and put it in
|
Add --pids-limit 0 to your run args |
Wait you're cgroups v2 now? I don't have that problem under cgroups v2 rootless. What does |
|
I can't get the infra container to start, because you're binding to ports 80 and 443 as non-root... |
setting /proc/sys/net/ipv4/ip_unprivileged_port_start |
Hmm and there it is If I remove your --pod it works |
To allow the pod to bind to those ports. |
I think it's because you're using a pod. When I run this as the user, rootless, I get this: Pod creates: Container (without split) creates: Through Systemd as the user, I get this: Container (without split) creates: |
TLDR, @giuseppe would have to modify/extend another PR to handle pods. It looks like when a container is spawned in a pod, it assumes its parent slice will be the parent cgroup path. (Which is reasonable) Since pod create doesn't have a --cgroups split option, the pod's conmon is attached to the service cgroup, and the pod's slice is in the user slice, divorced from the service's cgroup. You can't simultaneously have a service (i.e. elasticsearch) be part of the unit's service, and also the pod's slice. Nor can you have a second systemd unit muck around with the pod's cgroup - that's probably a bad idea. What's your desired outcome here, @jdoss jdoss? /system.slice/mycool-pod.service/supervisor -> pod conmon Then ALL the pod services aren't contained in a slice. Right now it's Is this insufficient in some way? |
Or maybe we should do this in a more systemd-like way? i.e. Slice=machines-mycool_pod.slice Pod Then everything is properly in a parent slice - is this what we'd want If so, the --cgroups split would have to be set at the pod create level, and child services would have to know if split is passed, to not inherit the cgroup-parent of the pod. |
@giuseppe I don't know what's causing this - but there are times when I need to set --pids-limit 0. It seems like there's a default of pids-limit 2048 coming from somewhere, not the config file and not the command line, and then when crun sees it can't do cgroups with pids-limit, it throws the runtime error. If you happen to get the cgroup right - i.e. it's something crun can modify and it has a pids controller, then the error isn't present. |
@goochjj I am trying to set things up so I can have many pods running under a rootless user/users via systemd units with the Since FCOS doesn't support user systemd units via Ignition, I have to set them up in as system units. Which is fine since I like using system units over user units anyways to prevent them from modified by nonroot users. |
Right, but all this works for you without --cgroups split, correct? Is there something you're hoping to gain with --cgroups split? |
The pids-limit is probably Podman automatically trying to set the maximum available for that rlimit - we should code that to only happen if cgroups are present. |
@goochjj I was running FCOS with cgroups v1 up until I saw this thread that introduced I am am not trying to gain anything specific by using |
@Mehon I'm unclear on why cgroups aren't present... let alone that default. It's really annoying, and seems to be cgroupsv1 specific. Should I create this as a separate issue? |
I believe that's a requirement forced on us by cgroups v1 not being safe for rootless use, unless I'm greatly misunderstanding? |
@mheon I'm fine with that, as long as it doesn't explicitly require me to --pids-limit 0 everything, which it's currently doing. This code
in pkg/spec/spec.go seems to indicate it should already be ignoring the default on cgroups v1. I'm digging. |
Cuz this isn't great.
|
This is definitely a bug. Is this 2.0? |
2.1.0-dev. Actually, master, plus my sdnotify So, sounds like I should create a new issue. |
A friendly reminder that this issue had no activity for 30 days. |
Fixed in master. |
This is an RFE after talking with @mheon for a bit in IRC (thanks for that, sorry I kept you so late). In the shortest form I can think of the enhancement would be: facilitate podman/conmon interacting with systemd in a way that provides console output for systemctl and journalctl. In bullet form:
/sbin/nologin
/etc/systemd/system/<unit>.service
) that specifies that "system" user inUser=
.systemctl start <unit>.service
and be able to see the console output of the containerjournalctl -u <unit>.service
and be able to see the historical console output of the containerMy use case is that I want to use podman to run images that are essentially "system" services, but as "user" because I want the rootless isolation. I've been consuming podman for a bit now (starting with 1.8.2) and am likely stuck on that version because in new versions my approach gets broken: I loose all logging from the container. I have tried
--log-driver=journald
but have no idea how to find a hand-hold for the console output (what-u
should I be looking for, because its not .service, and it's not the container... and it's not podman-.scope). Basically podman doesn't provide the init system with a console hand-hold so I'm rolling blind.Here is an example of mattermost, under 1.8.2 this works how I'd like it to work (e.g. I'm getting console output). I'm doing some things that are different than what
podman generate systemd
offers, but it's because my explicit goal is to:sudo -u <user> -h <home> podman logs <container-name>
)With these units above I am able to:
Type=simple
and lack of-d
)systemctl <unit>
andjournalctl -u <unit>
ExecPre
andExecStop
)podman-mattermost.service
requires thepodman-mattermost-postgres.service
(Requires=
)podman-mattermost-postgres.service
will get a stop signal if I stoppodman-mattermost.service
(PartOf=
)podman-mattermost.service
closes out the networking namespace beforepodman-mattermost-postgres.service
can finish up (I think), so it's not ideal... i'd be interested in suggestions.Tagging @lsm5 as well since I think for my use case I'm relegated to use 1.8.2 in F32 for the time being... so I am wondering if that is going away any-time soon?
The text was updated successfully, but these errors were encountered: