Resource-leaked `logs-collector-XXXX` containers #1833

mieubrisse · 2023-11-19T16:36:32Z

What's your CLI version?

0.85.29

Description & steps to reproduce

My local Docker-backed Kurtosis is empty:

kurtosis enclave ls
UUID   Name   Status   Creation Time

kurtosis clean -a doesn't clean anything, as expected.

However, there are two zombie logs-collector containers hanging around that don't go away:

These do not correspond to any enclave networks, as there are none (as expected):

$ docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
8662e507c238   bridge    bridge    local
deab0a0cb95e   host      host      local
a8c59496b516   none      null      local

More details in #1832

Desired behavior

Kurtosis should always fully clean up after itself.

The original clean logic was written like so:

Use the presence/absence of -a to decide which enclaves ought to remain (if -a is specified, no resources should remain)
After finding the enclaves-to-remain, find and kill all Kurtosis resources that aren't in the enclave-to-remain set in the proper order (containers -> volumes -> networks). We DON'T say "look for enclave networks, and then remove resources associated with them" because if the enclave network is removed without all the resources being removed for any reason (e.g. maybe a failure happened), then we permanently leak resources.

In other words, we first grab ALL Kurtosis resources, default them all to "destroy this resource", and then remove the ones that should be preserved. This guarantees that we process every single resource Kurtosis can create.

This algorithm is very important, because it guarantees that even if an enclave is in an inconsistent state (e.g. an error happened, or maybe the user messed with Docker), all the resources for the enclave can still get fully cleaned up.

I suspect that the logs-collector cleanup has not been written this way, and is instead written with, "first try to remove the enclave network and all the containers in it, then try to remove the logs-collector". This means that if a failure happens (e.g. because of Ctrl-C), then the enclave network is lost but the logs-collector hasn't been cleaned up.

What is the severity of this bug?

Critical; I am blocked and Kurtosis is unusable for me because of this bug.

What area of the product does this pertain to?

CLI: the Command Line Interface

The text was updated successfully, but these errors were encountered:

tedim52 · 2023-11-22T19:52:17Z

Confirming that the clean logic is as you described it should be (at least on the surface). I confirmed by ensuring that the log collector was one of the containers removed in the destroyContainersInEnclaves step that happens before destroying volumes then destroying the enclave networks.

Going to continue debugging what may be causing the leaked logs collectors.

## Description: Fixes: #1833 The resource leak was being caused by a state where if a failure occurred while creating an enclave (eg. ctrl+C while running `kurtosis enclave add`) at a point where the log collector container was created BEFORE it had been connected to the network AND before the necessary `defer undo`s to clean up the log collector had been queued THEN the network would get cleaned up by the `defer undo` from `CreateNetwork`, but the log collector container would remain. Any attempt to do a `kurtosis clean -a` would fail to clean log collector container because the network(and thus enclave) the container was created for had already been cleaned up. After digging - I realized the log collector was being created in `container-engine-lib` as opposed to in the `engine` Moving the logic to the engine AFTER the enclave is created fixes the issue. If there is an error at ANY point in the creation of the log collector container (even if the log collectors `defer undo` hasn't been queued), the `defer undo` from the `CreateEnclave` call will clean the log collector because it has a label with the `enclaveUUID`. ## Is this change user facing? NO ## References: #1833

mieubrisse added the bug Something isn't working label Nov 19, 2023

mieubrisse assigned tedim52 Nov 19, 2023

github-actions bot added cli For bugs relating to the CLI critical Critical bug or feature labels Nov 19, 2023

tedim52 mentioned this issue Nov 22, 2023

"Execution in progress" hangs for 10 minutes and then "Error encountered running Starlark code." #1769

Closed

tedim52 mentioned this issue Nov 29, 2023

fix: move log collector creation logic #1870

Merged

tedim52 closed this as completed in #1870 Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource-leaked `logs-collector-XXXX` containers #1833

Resource-leaked `logs-collector-XXXX` containers #1833

mieubrisse commented Nov 19, 2023

tedim52 commented Nov 22, 2023

Resource-leaked logs-collector-XXXX containers #1833

Resource-leaked logs-collector-XXXX containers #1833

Comments

mieubrisse commented Nov 19, 2023

What's your CLI version?

Description & steps to reproduce

Desired behavior

What is the severity of this bug?

What area of the product does this pertain to?

tedim52 commented Nov 22, 2023

Resource-leaked `logs-collector-XXXX` containers #1833

Resource-leaked `logs-collector-XXXX` containers #1833