Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/datadog] Test case level retries for flaky DD receiver integration test #24857

Closed

Conversation

zcross
Copy link
Contributor

@zcross zcross commented Aug 3, 2023

Description:
Hopefully reduces the probability (without guaranteeing anything, because test executions have a timeout of a few minutes and shouldn't spend a lot of time retrying) of this class of test flake:

E.g., https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/5752929108/job/15595049479

make[2]: Entering directory '/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/datadogreceiver'
go test -race -timeout 300s -parallel 4 --tags="" ./...
...
--- FAIL: TestDatadogServer (0.00s)
    receiver_test.go:47: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/datadogreceiver/receiver_test.go:47
        	Error:      	Received unexpected error:
        	            	failed to create datadog listener: listen tcp 127.0.0.1:8126: bind: address already in use
        	Test:       	TestDatadogServer
FAIL
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/datadogreceiver	0.028s
?   	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/datadogreceiver/internal/metadata	[no test files]
FAIL

Link to tracking Issue: N/A

Testing: Tests pass locally with same -parallel N invocation used by CI

Documentation: N/A

@zcross zcross requested a review from a team August 3, 2023 21:12
@zcross zcross changed the title [receiver/datadog] Test case level retries for flaky-when-concurrent … [receiver/datadog] Test case level retries for flaky DD receiver integration test Aug 3, 2023
func startComponentWithRetries(ctx context.Context, component component.Component, host component.Host) error {
return backoff.Retry(func() error {
return component.Start(ctx, host)
}, backoff.WithMaxRetries(backoff.NewConstantBackOff(time.Second), 5))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainers: feel free to fiddle with the parameters here (e.g., feel free to edit on-PR yourself). I just wanted to point out the flake and propose a fix. I wouldn't want to use too many retries to slow down overall build times and mask an issue that should be fixed more structurally. However, I think that it makes sense for the integration test to use the expected standard port instead of a randomly generated (guaranteed-not-in-use) port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, alternatively, we can just mash the two test cases (functions) in this suite (file) into one which will guarantee serial execution of their steps regardless of -parallel N invocations of gotest in CI.

@atoulme
Copy link
Contributor

atoulme commented Aug 4, 2023

@zcross please take a look at #24895 I have run into this problem last night and have devised a fix.

@zcross
Copy link
Contributor Author

zcross commented Aug 4, 2023

@zcross please take a look at #24895 I have run into this problem last night and have devised a fix.

much better

@zcross zcross closed this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants