Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: TestOtelFileProcessing – there should be exported logs by now #4617

Closed
rdner opened this issue Apr 24, 2024 · 9 comments · Fixed by #4638
Closed

[Flaky Test]: TestOtelFileProcessing – there should be exported logs by now #4617

rdner opened this issue Apr 24, 2024 · 9 comments · Fixed by #4638
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Apr 24, 2024

Failing test case

TestOtelFileProcessing

Error message

there should be exported logs by now

Build

https://buildkite.com/elastic/elastic-agent/builds/8506#018f0d68-b5e4-46e6-93a8-91612f46c4e8

OS

Linux

Stacktrace and notes

=== RUN   TestOtelFileProcessing
    fetcher.go:95: Using existing artifact elastic-agent-8.14.0-SNAPSHOT-linux-x86_64.tar.gz
    fixture.go:281: Extracting artifact elastic-agent-8.14.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestOtelFileProcessing3713650306/002
    fixture.go:299: Completed extraction of artifact elastic-agent-8.14.0-SNAPSHOT-linux-x86_64.tar.gz to /tmp/TestOtelFileProcessing3713650306/002
    fixture.go:957: All component specifications where removed
    fixture.go:998: Placed component specifications: fake, fake-shipper
    log.go:85: Starting in otel mode
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.019Z","log.origin":{"file.name":"cmd/run.go","file.line":180},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":18312,"agent.version":"8.14.0","agent.unprivileged":true,"ecs.version":"1.6.0"}
    log.go:85: 2024-04-24T00:32:50.028Z	info	[email protected]/service.go:99	Setting up own telemetry...
    log.go:85: 2024-04-24T00:32:50.028Z	info	[email protected]/telemetry.go:103	Serving metrics	{"address": ":8888", "level": "Normal"}
    log.go:85: 2024-04-24T00:32:50.034Z	info	[email protected]/service.go:166	Starting /tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent...	{"Version": "8.14.0", "NumCPU": 2}
    log.go:85: 2024-04-24T00:32:50.034Z	info	extensions/extensions.go:34	Starting extensions...
    fixture.go:656: >> running binary with: [/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent otel validate --config /tmp/TestOtelFileProcessing3713650306/001/otel-valid.yml --feature-gates foo.bar]
    log.go:85: 2024-04-24T00:32:50.037Z	info	adapter/receiver.go:45	Starting stanza receiver	{"kind": "receiver", "name": "filelog", "data_type": "logs"}
    log.go:85: 2024-04-24T00:32:50.039Z	info	[email protected]/service.go:192	Everything is ready. Begin running and processing data.
    log.go:85: 2024-04-24T00:32:50.040Z	warn	localhostgate/featuregate.go:63	The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.	{"feature gate ID": "component.UseLocalHostAsDefaultHost"}
    log.go:85: 2024-04-24T00:32:50.040Z	error	[email protected]/collector.go:278	Asynchronous error received, terminating process	{"error": "listen tcp :8888: bind: address already in use"}
    log.go:85: go.opentelemetry.io/collector/otelcol.(*Collector).Run
    log.go:85: /go/pkg/mod/go.opentelemetry.io/collector/[email protected]/collector.go:278
    log.go:85: github.com/elastic/elastic-agent/internal/pkg/otel.Run
    log.go:85: /go/src/github.com/elastic/elastic-agent/internal/pkg/otel/run.go:47
    log.go:85: github.com/elastic/elastic-agent/internal/pkg/agent/cmd.runCollector.func3
    log.go:85: /go/src/github.com/elastic/elastic-agent/internal/pkg/agent/cmd/otel.go:95
    log.go:85: 2024-04-24T00:32:50.040Z	info	[email protected]/service.go:229	Starting shutdown...
    log.go:85: 2024-04-24T00:32:50.040Z	info	adapter/receiver.go:140	Stopping stanza receiver	{"kind": "receiver", "name": "filelog", "data_type": "logs"}
    log.go:85: 2024-04-24T00:32:50.055Z	info	extensions/extensions.go:59	Stopping extensions...
    log.go:85: 2024-04-24T00:32:50.055Z	info	[email protected]/service.go:243	Shutdown complete.
    fixture.go:656: >> running binary with: [/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent otel validate --config /tmp/TestOtelFileProcessing3713650306/001/otel-invalid.yml]
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.317Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":156},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent","args":["/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent","watch","--path.config","/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64","--path.home","/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64"],"env":[],"dir":"","ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.318Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":163},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":18338,"agent.process.pid":18312,"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.318Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":151},"message":"releasing watcher 18338","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.318Z","log.origin":{"file.name":"cmd/run.go","file.line":264},"message":"APM instrumentation disabled","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.319Z","log.origin":{"file.name":"application/application.go","file.line":65},"message":"Gathered system information","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.323Z","log.origin":{"file.name":"application/application.go","file.line":71},"message":"Detected available inputs and outputs","log":{"source":"elastic-agent"},"inputs":["fake","fake-apm","fake-isolated-units"],"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.323Z","log.origin":{"file.name":"capabilities/capabilities.go","file.line":48},"message":"Capabilities file not found in /tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/capabilities.yml","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.324Z","log.origin":{"file.name":"application/application.go","file.line":77},"message":"Determined allowed capabilities","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.324Z","log.origin":{"file.name":"application/application.go","file.line":92},"message":"Loading baseline config from /tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent.yml","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.836Z","log.logger":"control","log.origin":{"file.name":"server/server.go","file.line":88},"message":"GRPC control socket listening at unix:///tmp/elastic-agent/pgu_7Fua0UpuvurzhelwZU8EgJd4wd5X.sock","log":{"source":"elastic-agent"},"address":"unix:///tmp/elastic-agent/pgu_7Fua0UpuvurzhelwZU8EgJd4wd5X.sock","ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.838Z","log.origin":{"file.name":"runtime/manager.go","file.line":213},"message":"Starting grpc control protocol listener on port 6789 with max_message_size 104857600","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.839Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":123},"message":"Source URI changed from \"https://artifacts.elastic.co/downloads/\" to \"https://artifacts.elastic.co/downloads/\"","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.839Z","log.origin":{"file.name":"reload/reload.go","file.line":49},"message":"Starting server","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.840Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":85},"message":"Starting stats endpoint","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.840Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":87},"message":"Metrics endpoint listening on: /tmp/elastic-agent/d7fe3e1b48cc0f2dafbf6995409671d83965889717bfb843c470966c984184e4.sock (configured: unix:///tmp/elastic-agent/d7fe3e1b48cc0f2dafbf6995409671d83965889717bfb843c470966c984184e4.sock)","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.840Z","log.logger":"composable.providers.docker","log.origin":{"file.name":"docker/docker.go","file.line":44},"message":"Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:50.942Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1181},"message":"Updating running component model","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:51.142Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1382},"message":"component model updated","log":{"source":"elastic-agent"},"changes":{"components":{},"outputs":{}},"ecs.version":"1.6.0"}
    log.go:96: {"log.level":"info","@timestamp":"2024-04-24T00:32:51.142Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":1181},"message":"Updating running component model","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
    fixture.go:656: >> running binary with: [/tmp/TestOtelFileProcessing3713650306/002/elastic-agent-8.14.0-SNAPSHOT-linux-x86_64/elastic-agent status --output json]
    otel_test.go:170: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/otel_test.go:170
        	Error:      	Condition never satisfied
        	Test:       	TestOtelFileProcessing
        	Messages:   	there should be exported logs by now
--- FAIL: TestOtelFileProcessing (191.29s)
@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Apr 24, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@rdner
Copy link
Member Author

rdner commented Apr 25, 2024

@michalpristas
Copy link
Contributor

{"error": "listen tcp :8888: bind: address already in use"}

probably valid error, setup changed a bit with update to 0.99

@rdner
Copy link
Member Author

rdner commented Apr 29, 2024

@rdner
Copy link
Member Author

rdner commented Apr 30, 2024

@michalpristas
Copy link
Contributor

related to this issue here: open-telemetry/opentelemetry-collector#10031
cause by pull open-telemetry/opentelemetry-collector#9257

@ycombinator
Copy link
Contributor

ycombinator commented Apr 30, 2024

Are we sure this test is failing due to open-telemetry/opentelemetry-collector#10031? I just built Agent from main, created the same OTel configuration files that's are being used by this test, and manually ran elastic-agent otel validate against those configuration files as expected.

$ ./elastic-agent version --binary-only
Binary: 8.15.0-SNAPSHOT (build: 3b0214f1e7e16124e7b1c5689e65632f749eedf4 at 2024-04-30 11:32:31 +0000 UTC)
$ cat /tmp/otel-processing.yaml
receivers:
  filelog:
    include: [ "/var/log/system.log", "/var/log/syslog"  ]
    start_at: beginning

exporters:
  file:
    path: /tmp/testfileprocessing.json

service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters:
        - file
$ ./elastic-agent otel validate --config /tmp/otel-processing.yaml
2024-04-30T04:37:20.272-0700	info	[email protected]/service.go:99	Setting up own telemetry...
2024-04-30T04:37:20.272-0700	info	[email protected]/telemetry.go:103	Serving metrics	{"address": ":8888", "level": "Normal"}
$ echo $?
0
$ cat /tmp/otel-invalid.yaml
receivers:
  filelog:
    include: [ "/var/log/system.log", "/var/log/syslog"  ]
    start_at: beginning

exporters:
  file:
    path: /tmp/testfileprocessing.json

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [nonexistingprocessor]
      exporters:
        - file
$ ./elastic-agent otel validate --config /tmp/otel-invalid.yaml
service::pipelines::logs: references processor "nonexistingprocessor" which is not configured
$ echo $?
1

@michalpristas
Copy link
Contributor

michalpristas commented Apr 30, 2024

it works locally on my end as well. my take on this is that machines on CIs are slower and it takes a bit of time to release bound port. in the meantime another process is starting. we've seen something like this before.

it started appearing just after i upgraded versions from 97

i also needed to change check for validate because in version 0.97 for valid config output is empty. for 97+ it shows these 3 lines you shared as well

@ycombinator
Copy link
Contributor

ycombinator commented Apr 30, 2024

my take on this is that machines on CIs are slower and it takes a bit of time to release bound port. in the meantime another process is starting. we've seen something like this before.

Thanks, I was also thinking it must be something along these lines, unfortunately 😞.

Given that, it would be tempting to add a bit of a sleep or something in the test between the two validate calls but I don't think that's the right solution, at least not for now. Since we aren't really depending on anything specific from OTel versions > 0.97 at the moment, pinning the OTel dependencies to 0.97, as you have done in #4638, is the right solution, at least for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants