container-remove event does not set exit code #19124

martinpitt · 2023-07-05T07:05:33Z

Feature request description

cockpit-podman is plagued with lots of race conditions and flaky tests. I have investigated many of them, but the remaining ones are due to a fundamental issue with the monitoring API.

The UI uses the libpod/events API, which notifies about high-level actions such as start or died, for example:

{"status":"start","id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","from":"localhost/test-busybox:latest","Type":"container","Action":"start","Actor":{"ID":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Attributes":{"containerExitCode":"0","image":"localhost/test-busybox:latest","name":"swamped-crate","podId":""}},"scope":"local","time":1688533611,"timeNano":1688533611933738591}

However, this does not contain any (or at least most) of the properties that the UI needs to show, so in reaction to these, the UI does a containers/json query for that container:

{"method":"GET","path":"/v1.12/libpod/containers/json","body":"","params":{"all":true,"filters":"{\"id\":[\"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2\"]}"}}

which then responds with all the info that the UI needs:

[{"AutoRemove":false,"Command":["sh","-c","echo 123; sleep infinity"],"Created":"2023-07-05T05:06:41.355776664Z","CreatedAt":"","Exited":false,"ExitedAt":1688533611,"ExitCode":0,"Id":"39c0313e0e35c49f56fa3b8a0c228cc6a58455846d5271c05365fa0df56876a2","Image":"localhost/test-busybox:latest","ImageID":"24ac8b76cfb0440579ade1908a8a765d3c8a62bd366058cf84e1a7d6754ee585","IsInfra":false,"Labels":null,"Mounts":[],"Names":["swamped-crate"],"Namespaces":{},"Networks":[],"Pid":19176,"Pod":"","PodName":"","Ports":null,"Size":null,"StartedAt":1688533611,"State":"running","Status":""}]

The problem is that this is racy: The /containers/json call is necessarily async, and when events come in bursts, they will then overlap. But their replies from podman are not coming in in the same order. This is a log capture from a part of the test where it does a few container operations like stopping and restarting a container. I stripped out all the JSON data for clarity, the important bit is the ordering:

> debug: podman user call 44:
> debug: podman user call 45:
> debug: podman user call 45 result:
> debug: podman user call 46:
> debug: podman user call 44 result:
> debug: podman user call 47:
> debug: podman user call 46 result:
> debug: podman user call 47 result:
> debug: podman user call 48:
> debug: podman user call 43 result:
> debug: podman user call 49:
> debug: podman user call 49 result:
> debug: podman user call 50:
> debug: podman user call 48 result:
> debug: podman user call 51:
> debug: podman user call 50 result
> debug: podman user call 52:
> debug: podman user call 52 result:
> debug: podman user call 51 result:

So if the container moves from "Running" → "Exited" → "Stopped" → "Restarting" → "Running", a jumbled response order can lead to swaps, and the final state reported in the UI is e.g. "Restarting" or "Exited". The latter happened in this run, where the screenshot says "Exited", but podman ps says "Up" (i.e. "Running"), as can be seen in the "----- user containers -----" dump in the log.

Suggest potential solution

My prefered solution would be to avoid having to call /containers/json after a "start" or "rename" event in the first place. That only leads to additional API traffic and thus more computational overhead on both the podman and the UI side, and is prone to these kinds of race conditions. D-Bus services like systemd or udisks generally solved this with the PropertiesChanged signal, i.e. there is a notification with the set of changed properties each time when there is a change. These are naturally ordered correctly, and the watcher can tally them up to always have an accurate model of the state without having to do extra "get" calls.

For the podman API, this cannot just be squeezed into the existing start (or remove, etc.) events, as the container properties can change more often, and also independently from the coarse-grained lifecycle events.

Perhaps this could introduce a new event type changed that gets fired whenever any property changes, and deliver the /containers/json info for the container(s) which changed. Both "your" (podman) and "my" (cockpit-podman) sides already have the code to generate/parse this information, it would just mean some minor plumbing changes.

If this is expensive, it may also be adequate to explicitly opt into getting these notifications, although connecting to /events generally already means that the listener wants to know this kind of information.

Have you considered any alternatives?

It may also be possible to change podman to not reply to requests out of order. I don't know how easy/hard that is with Go and coroutines. I know that it is very hard in JavaScript on the client side to reorder the replies.

It might be easier on our side to completely serialize all API calls, but that would make the UI very slow especially if there are many containers. These are independent from each other, so serializing calls is not conceptually necessary.

Additional context

No response

The text was updated successfully, but these errors were encountered:

martinpitt · 2023-07-05T07:06:29Z

@jelly FYI

Don't start pod and container state updates at the same time. They do asynchronous state updates which sometimes step on each other's toes and reverse the real event order. This mitigates the problem with getting podman API call results in the wrong order (containers/podman#19124); that causes containers to sometimes appear in a non-current state.

martinpitt · 2023-07-05T08:35:29Z

I just talked to @vrothberg , many thanks! Notes:

podman has a mode for logging more information in events, but it's opt-in (containers.conf), so nothing that c-podman can rely on.

But it should be possible to greatly improve this with the existing API:

Only do a /json inspect for the create event, to get image, mounts, pod etc. -- this information won't change
predict the future state for events like 'died', 'start'; the start time is contained in the 'start' event
ignore all transient state events like 'restart', 'mount', they will always be follow by an event for a persistent state (like 'died' or 'start')
also ignore unrelated states like 'prune'

With that, there should be by and large only one /json inspect instead of umpteen, which should avoid the problem to a degree where it's not relevant any more.

I'll try that in the next days (well, some PTO ahead), and report back here. In the best case, I'll just close this.

Thanks!

vrothberg · 2023-07-05T08:38:25Z

@martinpitt and I had a quick call on discussing the issue. We concluded that there is a way to significantly reduce the amount of API calls emitted by Cockpit.

At the time of writing, Cockpit inspects a container on (too) many container events. As mentioned in the issue description above, the inspections are done to get the current state of the container. We can reduce the inspects to only 1 by doing the following:

On a create event we know that a new container is there. To get all the metadata, we need to inspect it. That is the 1 call we need.
For all other events, no inspect is needed.
To display state (transitions) in the UI, Cockpit can interpret incoming events.
- create -> Created
- init -> Initialized
- start -> Running
- died -> Exited
- remove -> Removed (stop displaying the container in the UI)

When starting Cockpit, then all containers must be listed/inspected to get all the metadata and state.

Thanks for your summary, @martinpitt :) I think we have a shared understanding.

These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124

Luap99 · 2023-07-05T14:36:17Z

@martinpitt and I had a quick call on discussing the issue. We concluded that there is a way to significantly reduce the amount of API calls emitted by Cockpit.

At the time of writing, Cockpit inspects a container on (too) many container events. As mentioned in the issue description above, the inspections are done to get the current state of the container. We can reduce the inspects to only 1 by doing the following:
* On a `create` event we know that a new container is there.  To get all the metadata, we need to inspect it.  That is the 1 call we need.

* For all other events, no inspect is needed.

* To display state (transitions) in the UI, Cockpit can interpret incoming events.
  
  * `create` -> Created
  * `init` -> Initialized
  * `start` -> Running
  * `died` -> Exited
  * `remove` -> Removed (stop displaying the container in the UI)
When starting Cockpit, then all containers must be listed/inspected to get all the metadata and state.

Thanks for your summary, @martinpitt :) I think we have a shared understanding.

Can we close the issue, sounds like we do not need podman changes?

vrothberg · 2023-07-05T14:39:03Z

I leave it open to @martinpitt. I expect the issue to be resolved with the proposed idea but I am OK to leave it open until we're sure.

martinpitt · 2023-07-05T16:29:47Z

I'm inching closer in cockpit-project/cockpit-podman#1324 , so far it's looking good. I should finish this by tomorrow, and then most probably close this, unless something missing turns up.

These events change the state in a predictable way. Avoid the expensive updateContainer() call for these, to avoid multiple calls overlapping each other. See containers/podman#19124

martinpitt · 2023-07-06T02:26:16Z

* On a `create` event we know that a new container is there.  To get all the metadata, we need to inspect it.  That is the 1 call we need.
* For all other events, no inspect is needed.

Agreed, that would be nice. However, that also means that any kind of runtime related property needs to be part of the event info then.

  * `start` -> Running

This is missing the Pid and StartedAt properties. We can make-do without Pid, and we can probably estimate StartedAt with the event's time property. So that still seems okay. (I haven't yet put this into practice yet, still working on rearranging the event handling -- that requires some serialization with the /containers/json request).

  * `died` -> Exited

This is missing the ExitCode and Error properties, which cannot be inferred. FinishedAt etc. can again be estimated by the event's time.

BTW, as @vrothberg was mentioning log flooding: Every single event repeats the Attributes map -- these are properties of the image, and will never change. The only exception is the podId, which feels like it really shouldn't be in that image attributes list, it's a runtime property, not an image one -- and there's also the possibility of namespace collision. This is a separate issue, I filed it as #19143.

vrothberg · 2023-07-06T07:23:12Z

This is missing the Pid and StartedAt properties. We can make-do without Pid, and we can probably estimate StartedAt with the event's time property. So that still seems okay.

Yes, that will work.

This is missing the ExitCode and Error properties, which cannot be inferred.

Interesting! Should we make the exit code part of the attributes? @Luap99 WDYT?

With a burst of events these get called in parallel. But podman does not return them in the call order [1], which led to non-current state updates. [1] containers/podman#19124

martinpitt · 2023-07-06T07:36:51Z

Should we make the exit code part of the attributes

Just to avoid misunderstanding -- please not literally .Attributes (see #19143); but to the attributes of the event JSON message, that'd be nice. Thanks!

With a burst of events these get called in parallel. But podman does not return them in the call order [1], which led to non-current state updates. [1] containers/podman#19124

Luap99 · 2023-07-06T08:25:14Z

Adding the exit code seems useful in general but why are you interested in the PID? That should not add any useful value, by the time you read the PID it may be already reused by a different process (well theoretically at least).

Also what is Error even and why should this be only added to died?

martinpitt · 2023-07-06T08:40:08Z

why are you interested in the PID?

Not particularly -- it's just something that we happen to use in our tests to have a proper indicator when a restart command actually finished, but it's something that we can easily replace (with the time stamp).

so what is Error even

I can't say, I'm afraid. It's part of the /containers/json properties. If it can happen for other events, it should surely be added there as well.

martinpitt · 2023-07-06T08:57:59Z

Retitling / narrowing the scope of this to ExitCode, it seems everything else discussed here has a solution or workaround. Thank you!

These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124

These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124 Also fix the alphabetical sorting of the remaining events.

martinpitt · 2023-07-13T08:11:07Z

Nevermind, the exit code is part of the Attributes: already, at least for the died event:

{"status":"died","id":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","from":"docker.io/library/debian:sid","Type":"container","Action":"died","Actor":{"ID":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","Attributes":{"containerExitCode":"3","image":"docker.io/library/debian:sid","name":"admiring_bhabha","podId":""}},"scope":"local","time":1689235736,"timeNano":1689235736846535953}

It doesn't stick around, though -- the next event is usually "remove", and then it gets reset to 0:

{"status":"remove","id":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","from":"docker.io/library/debian:sid","Type":"container","Action":"remove","Actor":{"ID":"319da37bf7e6ee1b078b7d4e348710a67083b10b7ed7cf7e61e166bda92ab074","Attributes":{"containerExitCode":"0","image":"docker.io/library/debian:sid","name":"admiring_bhabha","podId":""}},"scope":"local","time":1689235736,"timeNano":1689235736905768245}

But good enough, I suppose.

vrothberg · 2023-08-17T08:19:15Z

I guess so

github-actions · 2023-09-19T00:06:48Z

A friendly reminder that this issue had no activity for 30 days.

…om int to int ptr Added additional check for event type to be remove and set the correct exitcode. While it was getting difficult to maintain the omitempty notation for Event->ContainerExitCode, changing the type from int to int ptr gives us the ability to check for ContainerExitCode to be not nil and continue operations from there. closes containers#19124 Signed-off-by: Chetan Giradkar <[email protected]>

martinpitt added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023

martinpitt mentioned this issue Jul 5, 2023

Debug flakes cockpit-project/cockpit-podman#1324

Closed

martinpitt mentioned this issue Jul 5, 2023

Serialize start and remove event handling cockpit-project/cockpit-podman#1334

Merged

martinpitt mentioned this issue Jul 6, 2023

container event Attributes mixes unrelated namespaces and scopes #19143

Open

martinpitt changed the title ~~Race-free monitoring of containers property changes~~ Add "ExitCode" property to events Jul 6, 2023

martinpitt closed this as completed Jul 13, 2023

martinpitt changed the title ~~Add "ExitCode" property to events~~ Race-free monitoring of containers property changes Jul 13, 2023

rhatdan removed the stale-issue label Aug 16, 2023

github-actions bot added the stale-issue label Sep 19, 2023

cgiradkar mentioned this issue Nov 6, 2023

Set correct exitcode in remove events #20609

Merged

openshift-merge-bot bot closed this as completed in #20609 Nov 28, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Feb 27, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container-remove event does not set exit code #19124

container-remove event does not set exit code #19124

martinpitt commented Jul 5, 2023

martinpitt commented Jul 5, 2023

martinpitt commented Jul 5, 2023

vrothberg commented Jul 5, 2023

Luap99 commented Jul 5, 2023

vrothberg commented Jul 5, 2023

martinpitt commented Jul 5, 2023

martinpitt commented Jul 6, 2023

vrothberg commented Jul 6, 2023

martinpitt commented Jul 6, 2023

Luap99 commented Jul 6, 2023

martinpitt commented Jul 6, 2023

martinpitt commented Jul 6, 2023

martinpitt commented Jul 13, 2023

vrothberg commented Aug 17, 2023

github-actions bot commented Sep 19, 2023

container-remove event does not set exit code #19124

container-remove event does not set exit code #19124

Comments

martinpitt commented Jul 5, 2023

Feature request description

Suggest potential solution

Have you considered any alternatives?

Additional context

martinpitt commented Jul 5, 2023

martinpitt commented Jul 5, 2023

vrothberg commented Jul 5, 2023

Luap99 commented Jul 5, 2023

vrothberg commented Jul 5, 2023

martinpitt commented Jul 5, 2023

martinpitt commented Jul 6, 2023

vrothberg commented Jul 6, 2023

martinpitt commented Jul 6, 2023

Luap99 commented Jul 6, 2023

martinpitt commented Jul 6, 2023

martinpitt commented Jul 6, 2023

martinpitt commented Jul 13, 2023

vrothberg commented Aug 17, 2023

github-actions bot commented Sep 19, 2023