Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the usage of tianon/true images with docker compose up --wait #1720

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Mar 13, 2024

After having a look at the code I realised that the tianon/true images, as well as the respective docker compose *_is_ready services, are required only in the detached mode of docker compose up. Specifically, based on the dependencies of services, docker compose will detach when all services are reported as healthy and only *_is_ready ones are in the state of Started. So elastic-package stack up -d -v would look like this image

Screenshot 2024-03-13 at 1 28 14 AM

Notice that Done is being printed while *_is_ready services are at Started

This PR substitutes the usage of such former services with the built-in docker compose --wait for detached mode which results in the following image
Screenshot 2024-03-13 at 1 41 20 AM

Notice that Done is being printed when all services are reported Healthy

As for the compose up attached mode this essentially remains the same. As before the invocation will immediately transition to the container logs.

Why we should try to avoid tianon/true images?! Because many devs employ Apple-silicon macbooks and this will get us a step closer running natively without any ISA emulation. This last bit, assuming this PR will make it, requires only elastic-package-registry and fleet-server to offer arm64 images and native execution here we go.

As always I may have missed something in my logic above so please feel free to correct me 🙂

This PR closes #1708 (PS: @mrodm I told you in EAH that I didn't open an issue because I wanted to open a PR, but aside joking, thanks for capturing this in an issue 😄)

@pkoutsovasilis pkoutsovasilis self-assigned this Mar 13, 2024
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch from 4828815 to cc02bc7 Compare March 13, 2024 01:10
@mrodm mrodm requested a review from a team March 13, 2024 10:54
Copy link
Contributor

@mrodm mrodm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pkoutsovasilis, It looks interesting this change!

I added some comments about this, I see that at least there is a different behavior. The start up process is not repeated if elastic-agent container fails, as it happened in main (non-detached mode).

Another concern about this PR is that it would break the current support of elastic-package with docker-compose V1. Should this support be removed as part of the future 1.0 version ? @jsoriano

This is also true that the support from docker itself (docker-compose release notes):

Effective July 2023, Compose V1 stopped receiving updates and is no longer in new Docker Desktop releases.


var allServices []string
for _, aService := range services {
allServices = append(allServices, aService, fmt.Sprintf("%s_%s", aService, readyServicesSuffix))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not needed to support docker-compose V1 and it can be removed all the _is_ready containers.
It looks like that readyServicesSuffix constant could be deleted.

If that is the case, the code of Status function in internal/stack/status.go could also be simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those container definitions are part of the services required for testing (system tests).

Related to the status command, it should not affect since that command is intended to show just information about the containers related to the Stack (Kibana, Package Registry, Elasticsearch, etc.). It uses the docker-compose project from the profile:

containerIDs, err := docker.ContainerIDsWithLabel(projectLabelDockerCompose, DockerComposeProjectName(options.Profile))

About the test packages, not sure what it would be the best option.
Currently, the servicedeployer run in detached mode, using -d docker-compose flag:

ExtraArgs: []string{"--build", "-d"},

ExtraArgs: []string{"--build", "-d"},

ExtraArgs: []string{"--build", "-d"},

About the test packages, if servicedeployer is not updated with the new flags, those new flags those container should be kept. And it is also run a explicit method to wait for the containers being ready/healthy:

err = p.WaitForHealthy(ctx, opts)

err = p.WaitForHealthy(ctx, opts)

@jsoriano Should servicedeployer be updated too (Up options) with these new flags? Or keep the current implementation ? As they are running with -d , it looks safe.

About the test package , it could be removed... but in the integrations repository they would keep using that container. Probably, it could be kept the tianon/true container to be sure that it is tested also with that. WDYT ?

internal/stack/serverless.go Outdated Show resolved Hide resolved
@@ -111,7 +111,7 @@ func dockerComposeUp(options Options) error {

var args []string
if options.DaemonMode {
args = append(args, "-d")
args = append(args, "-d", "--wait", "--wait-timeout", fmt.Sprintf("%d", 600))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing in main branch without using detach mode (-d), if elastic-agent container fails... it is being re-tried the docker-compose up

dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1)
2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}}
2024/03/13 19:13:05 DEBUG output command: /usr/bin/docker inspect d5df6764fb51 09f479d5637e 644fbbaa8557 341a68995249 831674f8c23f 920f6d104b80 25de6ef50f35 8cd0417a9622 afa7b068ae1b 44f10e51526b
Elastic Agent failed to start, trying again in 10s.
2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose version --short
2024/03/13 19:13:16 DEBUG Determined Docker Compose version: 2.24.6
2024/03/13 19:13:16 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up
[+] Running 9/0

With this change, it looks like elastic-package does not get the error (elastic-agent failed to start) and it cannot retry the docker-compose up. Leaving the scenario with the elastic agent with exited status. It looks like is_ready containers help in this case, not sure how

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrodm this is weird as the e2e-test in the CI have caught one case that this happens. Can you help with an example that hopefully reproduces what you see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird as the e2e-test in the CI have caught one case that this happens.

Steps in CI use detached mode (-d):

elastic-package stack up -d -v

Can you help with an example that hopefully reproduces what you see?

Sure ! Here I was referring to the case where that flag is not used:

elastic-package stack up -v

Not all runs of this command fail with this, so it needs to be repeated until it is hit that error.

After some retries running the above command, elastic-agent-1 container could not start and elastic-package did not try to re-start the container as it would happen :

elastic-agent-1     | {"log.level":"info","@timestamp":"2024-03-26T11:51:50.649Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"}
elastic-agent-1     | Error: fail to enroll: fail to execute request to fleet-server: EOF
elastic-agent-1     | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1     | Error: enrollment failed: exit status 1
elastic-agent-1     | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 exited with code 1

And the status of the cluster:

 $ elastic-package stack status
Status of Elastic stack services:
╭──────────────────┬─────────┬───────────────────╮
│ SERVICE          │ VERSION │ STATUS            │
├──────────────────┼─────────┼───────────────────┤
│ elastic-agent    │ 8.12.2  │ exited (1)        │
│ elasticsearch    │ 8.12.2  │ running (healthy) │
│ fleet-server     │ 8.12.2  │ running (healthy) │
│ kibana           │ 8.12.2  │ running (healthy) │
│ package-registry │ latest  │ running (healthy) │
╰──────────────────┴─────────┴───────────────────╯

With the same options running with the latest published version, it does the retry:

elastic-agent-1              | {"log.level":"info","@timestamp":"2024-03-26T12:01:12.540Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":505},"message":"1st enrollment attempt failed, retrying for 10m0s, every 1m0s enrolling to URL: https://fleet-server:8220/","ecs.version":"1.6.0"}
elastic-agent-1              | Error: fail to enroll: fail to execute request to fleet-server: dial tcp 192.168.192.6:8220: connect: connection refused
elastic-agent-1              | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1              | Error: enrollment failed: exit status 1
elastic-agent-1              | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
elastic-agent-1 exited with code 1
dependency failed to start: container elastic-package-stack-elastic-agent-1 exited (1)
2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker ps -a --filter label=com.docker.compose.project=elastic-package-stack --format {{.ID}}
2024/03/26 13:01:12 DEBUG output command: /usr/bin/docker inspect 0ebea2626c5e b5873a01c18c 94ec7181f20c 0449f8bf38ef a8c2d3eaf656 cc5633104400 1916c6b5f1dc 25baea050703 b1af9e72d4f2 2f268ae81ff3
Elastic Agent failed to start, trying again in 10s.
2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose version --short
2024/03/26 13:01:22 DEBUG Determined Docker Compose version: 2.25.0
2024/03/26 13:01:22 DEBUG running command: /usr/bin/docker compose -f /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml -p elastic-package-stack up
WARN[0000] /home/mariorodriguez/.elastic-package/profiles/default/stack/snapshot.yml: `version` is obsolete 
[+] Running 9/0
 ✔ Container elastic-package-stack-package-registry-1           Running                                                                           0.0s 
 ✔ Container elastic-package-stack-elasticsearch-1              Running                                                                           0.0s 
 ✔ Container elastic-package-stack-package-registry_is_ready-1  Created                                                                           0.0s 
 ✔ Container elastic-package-stack-elasticsearch_is_ready-1     Created                                                                           0.0s 
 ✔ Container elastic-package-stack-kibana-1                     Running                                                                           0.0s 
 ✔ Container elastic-package-stack-kibana_is_ready-1            Created                                                                           0.0s 
 ✔ Container elastic-package-stack-fleet-server-1               Running                                                                           0.0s 
 ✔ Container elastic-package-stack-fleet-server_is_ready-1      Created                                                                           0.0s 
 ✔ Container elastic-package-stack-elastic-agent-1              Created                                                                           0.0s 
Attaching to elastic-agent-1, elastic-agent_is_ready-1, elasticsearch-1, elasticsearch_is_ready-1, fleet-server-1, fleet-server_is_ready-1, kibana-1, kibana_is_ready-1, package-registry-1, package-registry_is_ready-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's odd 🥲 hmmm I see, I think this is what happens; with the *_is_ready services in place we have dependencies on the actual services, so when elastic-agent fails the respective _is_ready service can't start because it's dependency failed completely and thus docker compose up return an error in the sense sorry I couldn't bring up all the services. However when we remove the *_is_ready services then this isn't triggering as nothing depends on elastic-agent and thus up considers that everything had been brought up and the user will deal with any errors visible in the logs... I don't have an immediate fix for that

@jsoriano
Copy link
Member

Another concern about this PR is that it would break the current support of elastic-package with docker-compose V1. Should this support be removed as part of the future 1.0 version ? @jsoriano

I think we can remove support for v1, it was completely discontinued last year, and is not included anymore in docker desktop distributions. The migration to v2 is straightforward, and should be transparent for anyone using a recent docker version, since #1592.
We would only need to double check that we don't break anything in CI that might be using older versions. And we could add an explicit error message when we detect v1 version.

No need to relate this change to #1158.

@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch 2 times, most recently from 6533727 to 6c28d49 Compare March 26, 2024 12:16
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/remove_tiano_images branch from 6c28d49 to 537ce34 Compare March 26, 2024 12:21
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pkoutsovasilis

@pkoutsovasilis
Copy link
Contributor Author

after speaking with @jsoriano we decided that is better to introduce a milder change to support native arm64 execution and exploiting docker-compose --wait feature as well as deprecating docker-compose v1 will be investigated further in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove tianon/true docker images by using "--wait" in docker-compose up commands
4 participants