Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Version 0.32.0 seems to ignore WithStartupTimeout #2633

Closed
KenjiTakahashi opened this issue Jul 9, 2024 · 12 comments
Closed

[Bug]: Version 0.32.0 seems to ignore WithStartupTimeout #2633

KenjiTakahashi opened this issue Jul 9, 2024 · 12 comments
Labels
bug An issue with the library

Comments

@KenjiTakahashi
Copy link

Testcontainers version

0.32.0

Using the latest Testcontainers version?

Yes

Host OS

Linux

Host arch

x86

Go version

1.22

Docker version

.

Docker info

.

What happened?

After updating to 0.32.0, we started getting errors like this:

can not create container: failed to start container: all exposed ports, [127.0.0.1:35533:35533/tcp], were not mapped in 5s: port 127.0.0.1:35533:35533/tcp is not mapped yet

Our setup code is like this:

req := testcontainers.ContainerRequest{
    Image:        "docker.elastic.co/elasticsearch/elasticsearch:8.14.2",
    ExposedPorts: []string{fmt.Sprintf("127.0.0.1:%[1]d:%[1]d/tcp", port)},
    WaitingFor: wait.ForHTTP("/_cluster/health").
        WithStartupTimeout(6 * time.Minute).
        WithPort(internalPort).
        WithResponseMatcher(func(body io.Reader) bool {
            response, err := io.ReadAll(body)
            if err != nil {
                return false
            }
            return strings.Contains(string(response), "green")
        }),
    AutoRemove: true,
}

container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
    ContainerRequest: req,
    Started:          true,
})
if err != nil {
    return nil, nil, fmt.Errorf("can not create container: %w", err)
}

We've tried changing the value of WithStartupTimeout, but the error always says 5s.

Version 0.31.0 still works as expected.

Relevant log output

No response

Additional information

No response

@KenjiTakahashi KenjiTakahashi added the bug An issue with the library label Jul 9, 2024
@mdelapenya
Copy link
Member

It's weird because the wait package has not been modified in this release, and looking at the commits I cannot see any one yet that could affect that part: v0.31.0...v0.32.0

Can you verify if this issue is present with the elasticsearch module too?

I'm in Gophercon, but will double check this later

Thanks for the report!

@andsleonardo
Copy link

I'm facing the same error after updating to v0.32.0.

perhaps it's coming from these changes made to lifecycle.go? there's a hardcoded max retry interval of 5 seconds there.

@EsterfanoLopes
Copy link

Yeap, having the same issue here

@mdelapenya
Copy link
Member

perhaps it's coming from these changes made to lifecycle.go

Indeed, I think you are right and that block is not honoring the wait strategies at all. I did forget the possibility of having other wait strategies with custom timeouts, so that fixed 5s one seems incorrect. Let's use this issue for ideas for the fix: I can do a patch release right after that for the fix

@andsleonardo
Copy link

andsleonardo commented Jul 11, 2024

@mdelapenya would it make sense to inject the container request, or its wait strategy, into that defaultReadinessHook and use the timeout/deadline details in the backoff/retry? 🤔

@stevenh
Copy link
Collaborator

stevenh commented Aug 8, 2024

Should it be just the default if dockerContainer.WaitingFor is nil?

This was referenced Aug 9, 2024
@mdelapenya
Copy link
Member

Hi @KenjiTakahashi thanks for opening this issue. I'm trying to reproduce it and, before hitting the bug, I have a few questions:

  • you are using a fixed port for the elasticsearch's 9200 well-known port. Is there a reason to not used the random ports that testcontainers-go offers?
  • in the wait strategy, why checking for an internal port? what's the value in your code? Also I think there is no need to wait for 6 minutes for cluster healthiness.
  • using AutoRemove is deprecated. I'd prefer relying on Ryuk to prune the running containers.

When I remove all of them in a local test, I get elasticsearch is failing with the following error:

{"@timestamp":"2024-08-21T09:06:54.732Z", "log.level":"ERROR", "message":"node validation exception\n[1] bootstrap checks failed. You must address the points described in the following [1] lines before starting Elasticsearch. For more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.14/bootstrap-checks.html]\nbootstrap check failure [1] of [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]; for more information see [https://www.elastic.co/guide/en/elasticsearch/reference/8.14/_maximum_map_count_check.html]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.bootstrap.Elasticsearch","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/docker-cluster.log
{"@timestamp":"2024-08-21T09:06:54.737Z", "log.level": "INFO", "message":"stopping ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}
{"@timestamp":"2024-08-21T09:06:54.763Z", "log.level": "INFO", "message":"stopped", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}
{"@timestamp":"2024-08-21T09:06:54.763Z", "log.level": "INFO", "message":"closing ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}
{"@timestamp":"2024-08-21T09:06:54.777Z", "log.level": "INFO", "message":"closed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}
{"@timestamp":"2024-08-21T09:06:54.780Z", "log.level": "INFO", "message":"Native controller process has stopped - no new native processes can be started", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.NativeController","elasticsearch.node.name":"c8499dc4c8f9","elasticsearch.cluster.name":"docker-cluster"}

ERROR: Elasticsearch died while starting up, with exit code 78

I verified that the configuration of the container is incorrect, please compare with the current Elasticsearch module we have: https://github.com/testcontainers/testcontainers-go/blob/main/modules/elasticsearch/elasticsearch.go#L43-L67

Is there a reason to not use it? It's already been tuned up by the Elastic folks. The code would just like this:

container, err := elasticsearch.Run(ctx, "docker.elastic.co/elasticsearch/elasticsearch:8.14.2")
if err != nil {
	t.Fatalf("Could not start container: %s", err)
}

Let's address that first, and then we can move on with the startup timeout

@KenjiTakahashi
Copy link
Author

Well, our internal CI is kinda slow and is effectively a "Docker-in-Docker-in-Docker". Things like RYUK tend to not play well in such setup (I'd argue that it rarely plays well in general, but that's unrelated) and the timeouts have to be long to ensure success (maybe not 6m long, but 🤷).

We do not listen on 9200, we do listen on whatever the internalPort var is pointing to, which is some port that the kernel told us is free to use. It is set up like this for unrelated reasons.

I have left out some of the env config for brevity, but here is the actual one:

Env: map[string]string{
    "discovery.type":                  "single-node",
    "http.publish_host":               "_local_",
    "http.port":                       strconv.Itoa(port),
    "xpack.security.enabled":          "false",
    "xpack.security.http.ssl.enabled": "false",
    "ES_JAVA_OPTS":                    "-Xmx256m",
}

For the error you're getting, did you get a chance to read it? It complains about the value of vm.max_map_count in your kernel being too low. Nothing to do with container configuration.

Sorry, but none of this sounds related to the issue at hand: That WithStartupTimeout stopped doing what it was doing.

@mdelapenya
Copy link
Member

@KenjiTakahashi I'm sorry if my comment caused confusion. I was trying to prune the errors I found locally while debugging the repro code snippet in this issue, which is what I usually do with all the PRs and issues we receive. So it could be the case I missed the error here. I'd appreciate completeness in the repro code so I can try to reproduce it with more precision 🙏

Going back to the issue, #2691 #2718 could be related to a potential fix, so I'd suggest bumping to the latest release, v0.33.0 if possible. The suggestions in #2633 (comment) could also help. I wonder if you would have time for contributing them.

I'd like to also understand the reasons and use case for you to use a fixed port for the well-known port of Elasticsearch (9200/tcp ?), as Testcontainers advocates against it. I'm more curious about the use case where the internal port is being obtained as a free port from the OS before hand, and used it as part of the same wait strategy for the cluster health. The current Elasticsearch module does it already for you.

@zregvart
Copy link
Contributor

We had this error reported with updating to 0.32.0, updating to 0.33.0 fixed the issue for us.

@stevenh
Copy link
Collaborator

stevenh commented Oct 4, 2024

I'm going to close this based on the previous post, if there is still an issue there please reply and we can re-open

@stevenh stevenh closed this as completed Oct 4, 2024
@KenjiTakahashi
Copy link
Author

Sorry for the delay! We've checked the latest version and it works fine now, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
Development

No branches or pull requests

6 participants