Replace the usage of tianon/true images with docker compose up --wait #1720

pkoutsovasilis · 2024-03-13T00:03:42Z

After having a look at the code I realised that the tianon/true images, as well as the respective docker compose *_is_ready services, are required only in the detached mode of docker compose up. Specifically, based on the dependencies of services, docker compose will detach when all services are reported as healthy and only *_is_ready ones are in the state of Started. So elastic-package stack up -d -v would look like this image

Notice that Done is being printed while *_is_ready services are at Started

This PR substitutes the usage of such former services with the built-in docker compose --wait for detached mode which results in the following image

Notice that Done is being printed when all services are reported Healthy

As for the compose up attached mode this essentially remains the same. As before the invocation will immediately transition to the container logs.

Why we should try to avoid tianon/true images?! Because many devs employ Apple-silicon macbooks and this will get us a step closer running natively without any ISA emulation. This last bit, assuming this PR will make it, requires only elastic-package-registry and fleet-server to offer arm64 images and native execution here we go.

As always I may have missed something in my logic above so please feel free to correct me 🙂

This PR closes #1708 (PS: @mrodm I told you in EAH that I didn't open an issue because I wanted to open a PR, but aside joking, thanks for capturing this in an issue 😄)

mrodm

Thanks @pkoutsovasilis, It looks interesting this change!

I added some comments about this, I see that at least there is a different behavior. The start up process is not repeated if elastic-agent container fails, as it happened in main (non-detached mode).

Another concern about this PR is that it would break the current support of elastic-package with docker-compose V1. Should this support be removed as part of the future 1.0 version ? @jsoriano

This is also true that the support from docker itself (docker-compose release notes):

Effective July 2023, Compose V1 stopped receiving updates and is no longer in new Docker Desktop releases.

mrodm · 2024-03-13T18:03:24Z

internal/stack/compose.go

-
-	var allServices []string
-	for _, aService := range services {
-		allServices = append(allServices, aService, fmt.Sprintf("%s_%s", aService, readyServicesSuffix))


If it is not needed to support docker-compose V1 and it can be removed all the _is_ready containers.
It looks like that readyServicesSuffix constant could be deleted.

If that is the case, the code of Status function in internal/stack/status.go could also be simplified.

https://github.com/elastic/elastic-package/blob/main/test/packages/parallel/sql_input/_dev/deploy/docker/docker-compose.yml#L10 hmmm not sure about this

Those container definitions are part of the services required for testing (system tests).

Related to the status command, it should not affect since that command is intended to show just information about the containers related to the Stack (Kibana, Package Registry, Elasticsearch, etc.). It uses the docker-compose project from the profile:

elastic-package/internal/stack/compose.go

Line 189 in 62fe1c9

containerIDs, err := docker.ContainerIDsWithLabel(projectLabelDockerCompose, DockerComposeProjectName(options.Profile))

About the test packages, not sure what it would be the best option.
Currently, the servicedeployer run in detached mode, using -d docker-compose flag:

elastic-package/internal/servicedeployer/compose.go

Line 109 in 62fe1c9

ExtraArgs: []string{"--build", "-d"},

elastic-package/internal/servicedeployer/custom_agent.go

Line 132 in 62fe1c9

ExtraArgs: []string{"--build", "-d"},

elastic-package/internal/servicedeployer/terraform.go

Line 145 in 62fe1c9

ExtraArgs: []string{"--build", "-d"},

About the test packages, if servicedeployer is not updated with the new flags, those new flags those container should be kept. And it is also run a explicit method to wait for the containers being ready/healthy:

elastic-package/internal/servicedeployer/terraform.go

Line 152 in 62fe1c9

err = p.WaitForHealthy(ctx, opts)

elastic-package/internal/servicedeployer/custom_agent.go

Line 150 in 62fe1c9

err = p.WaitForHealthy(ctx, opts)

@jsoriano Should servicedeployer be updated too (Up options) with these new flags? Or keep the current implementation ? As they are running with -d , it looks safe.

About the test package , it could be removed... but in the integrations repository they would keep using that container. Probably, it could be kept the tianon/true container to be sure that it is tested also with that. WDYT ?

internal/stack/serverless.go

internal/stack/_static/docker-compose-stack.yml.tmpl

mrodm · 2024-03-13T19:38:29Z

internal/stack/compose.go

@@ -111,7 +111,7 @@ func dockerComposeUp(options Options) error {

 	var args []string
 	if options.DaemonMode {
-		args = append(args, "-d")
+		args = append(args, "-d", "--wait", "--wait-timeout", fmt.Sprintf("%d", 600))


Testing in main branch without using detach mode (-d), if elastic-agent container fails... it is being re-tried the docker-compose up

dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1) 2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}} 2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker inspect d5df6764fb51 09f479d5637e 644fbbaa8557 341a68995249 831674f8c23f 920f6d104b80 25de6ef50f35 8cd0417a9622 afa7b068ae1b 44f10e51526b Elastic Agent failed to start, trying again in 10s. 2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose version --short 2024/03/13 19:13:16 DEBUG Determined Docker Compose version: 2.24.6 2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up [+] Running 9/0

With this change, it looks like elastic-package does not get the error (elastic-agent failed to start) and it cannot retry the docker-compose up. Leaving the scenario with the elastic agent with exited status. It looks like is_ready containers help in this case, not sure how

@mrodm this is weird as the e2e-test in the CI have caught one case that this happens. Can you help with an example that hopefully reproduces what you see?

this is weird as the e2e-test in the CI have caught one case that this happens.

Steps in CI use detached mode (-d):

elastic-package/scripts/test-check-packages.sh

Line 72 in caf7758

elastic-package stack up -d -v

Can you help with an example that hopefully reproduces what you see?

Sure ! Here I was referring to the case where that flag is not used:

elastic-package stack up -v

Not all runs of this command fail with this, so it needs to be repeated until it is hit that error.

After some retries running the above command, elastic-agent-1 container could not start and elastic-package did not try to re-start the container as it would happen :

elastic-agent-1 | {"log.level":"info","@timestamp":"2024-03-26T11:51:50.649Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"} elastic-agent-1 | Error: fail to enroll: fail to execute request to fleet-server: EOF elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html elastic-agent-1 | Error: enrollment failed: exit status 1 elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html elastic-agent-1 exited with code 1

And the status of the cluster:

$ elastic-package stack status Status of Elastic stack services: ╭──────────────────┬─────────┬───────────────────╮ │ SERVICE │ VERSION │ STATUS │ ├──────────────────┼─────────┼───────────────────┤ │ elastic-agent │ 8.12.2 │ exited (1) │ │ elasticsearch │ 8.12.2 │ running (healthy) │ │ fleet-server │ 8.12.2 │ running (healthy) │ │ kibana │ 8.12.2 │ running (healthy) │ │ package-registry │ latest │ running (healthy) │ ╰──────────────────┴─────────┴───────────────────╯

With the same options running with the latest published version, it does the retry:

elastic-agent-1 | {"log.level":"info","@timestamp":"2024-03-26T12:01:12.540Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"} elastic-agent-1 | Error: fail to enroll: fail to execute request to fleet-server: dial tcp 192.168.192.6:8220: connect: connection refused elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html elastic-agent-1 | Error: enrollment failed: exit status 1 elastic-agent-1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html elastic-agent-1 exited with code 1 dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1) 2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}} 2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker inspect 0ebea2626c5e b5873a01c18c 94ec7181f20c 0449f8bf38ef a8c2d3eaf656 cc5633104400 1916c6b5f1dc 25baea050703 b1af9e72d4f2 2f268ae81ff3 Elastic Agent failed to start, trying again in 10s. 2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose version --short 2024/03/26 13:01:22 DEBUG Determined Docker Compose version: 2.25.0 2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up WARN[0000] /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml: `version` is obsolete [+] Running 9/0 ✔ Container elastic-package-stack-package-registry-1 Running 0.0s ✔ Container elastic-package-stack-elasticsearch-1 Running 0.0s ✔ Container elastic-package-stack-package-registry_is_ready-1 Created 0.0s ✔ Container elastic-package-stack-elasticsearch_is_ready-1 Created 0.0s ✔ Container elastic-package-stack-kibana-1 Running 0.0s ✔ Container elastic-package-stack-kibana_is_ready-1 Created 0.0s ✔ Container elastic-package-stack-fleet-server-1 Running 0.0s ✔ Container elastic-package-stack-fleet-server_is_ready-1 Created 0.0s ✔ Container elastic-package-stack-elastic-agent-1 Created 0.0s Attaching to elastic-agent-1, elastic-agent_is_ready-1, elasticsearch-1, elasticsearch_is_ready-1, fleet-server-1, fleet-server_is_ready-1, kibana-1, kibana_is_ready-1, package-registry-1, package-registry_is_ready-1

that's odd 🥲 hmmm I see, I think this is what happens; with the *_is_ready services in place we have dependencies on the actual services, so when elastic-agent fails the respective _is_ready service can't start because it's dependency failed completely and thus docker compose up return an error in the sense sorry I couldn't bring up all the services. However when we remove the *_is_ready services then this isn't triggering as nothing depends on elastic-agent and thus up considers that everything had been brought up and the user will deal with any errors visible in the logs... I don't have an immediate fix for that

jsoriano · 2024-03-18T10:10:24Z

Another concern about this PR is that it would break the current support of elastic-package with docker-compose V1. Should this support be removed as part of the future 1.0 version ? @jsoriano

I think we can remove support for v1, it was completely discontinued last year, and is not included anymore in docker desktop distributions. The migration to v2 is straightforward, and should be transparent for anyone using a recent docker version, since #1592.
We would only need to double check that we don't break anything in CI that might be using older versions. And we could add an explicit error message when we detect v1 version.

No need to relate this change to #1158.

…lt-in wait

elasticmachine · 2024-03-26T13:03:43Z

💚 Build Succeeded

Buildkite Build
Commit: 537ce34

History

💔 Build #2696 failed 6c28d49
💔 Build #2693 failed 6533727
💔 Build #2692 failed b77ed52
💚 Build #2584 succeeded cc02bc7
💔 Build #2583 failed 4828815

cc @pkoutsovasilis

pkoutsovasilis · 2024-03-26T14:25:12Z

after speaking with @jsoriano we decided that is better to introduce a milder change to support native arm64 execution and exploiting docker-compose --wait feature as well as deprecating docker-compose v1 will be investigated further in the future

pkoutsovasilis self-assigned this Mar 13, 2024

pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch from 4828815 to cc02bc7 Compare March 13, 2024 01:10

mrodm requested a review from a team March 13, 2024 10:54

mrodm reviewed Mar 14, 2024

View reviewed changes

feat: replace the usage of tianon/true images with docker compose bui…

1e37a2f

…lt-in wait

pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch 2 times, most recently from 6533727 to 6c28d49 Compare March 26, 2024 12:16

feat: deprecate support for docker compose v1

537ce34

pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch from 6c28d49 to 537ce34 Compare March 26, 2024 12:21

pkoutsovasilis closed this Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the usage of tianon/true images with docker compose up --wait #1720

Replace the usage of tianon/true images with docker compose up --wait #1720

pkoutsovasilis commented Mar 13, 2024 •

edited

Loading

mrodm left a comment

mrodm Mar 13, 2024

pkoutsovasilis Mar 26, 2024

mrodm Mar 26, 2024

mrodm Mar 13, 2024

pkoutsovasilis Mar 26, 2024

mrodm Mar 26, 2024

pkoutsovasilis Mar 26, 2024

jsoriano commented Mar 18, 2024

elasticmachine commented Mar 26, 2024

pkoutsovasilis commented Mar 26, 2024

Replace the usage of tianon/true images with docker compose up --wait #1720

Replace the usage of tianon/true images with docker compose up --wait #1720

Conversation

pkoutsovasilis commented Mar 13, 2024 • edited Loading

mrodm left a comment

Choose a reason for hiding this comment

mrodm Mar 13, 2024

Choose a reason for hiding this comment

pkoutsovasilis Mar 26, 2024

Choose a reason for hiding this comment

mrodm Mar 26, 2024

Choose a reason for hiding this comment

mrodm Mar 13, 2024

Choose a reason for hiding this comment

pkoutsovasilis Mar 26, 2024

Choose a reason for hiding this comment

mrodm Mar 26, 2024

Choose a reason for hiding this comment

pkoutsovasilis Mar 26, 2024

Choose a reason for hiding this comment

jsoriano commented Mar 18, 2024

elasticmachine commented Mar 26, 2024

💚 Build Succeeded

History

pkoutsovasilis commented Mar 26, 2024

pkoutsovasilis commented Mar 13, 2024 •

edited

Loading