Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestIntegrations flakiness #47156

Closed
GavinFrazar opened this issue Oct 3, 2024 · 6 comments · Fixed by #49850
Closed

TestIntegrations flakiness #47156

GavinFrazar opened this issue Oct 3, 2024 · 6 comments · Fixed by #49850
Assignees

Comments

@GavinFrazar
Copy link
Contributor

GavinFrazar commented Oct 3, 2024

Failure

Link(s) to logs

Relevant snippet

    integration_test.go:4867: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4867
        	Error:      	Received unexpected error:
        	            	failed connecting to host localhost:24990: failed to receive cluster details response
        	            		failed to dial target host
        	            		direct dialing to nodes not found in inventory is not supported
        	Test:       	TestIntegrations/ProxyHostKeyCheck/Disabled

I'm making this issue to cover multiple subtests of TestIntegrations that are all flaky for the same inventory error reason:

@GavinFrazar
Copy link
Contributor Author

@zmb3 who would be best to assign this since Alex is no longer with Teleport?

@GavinFrazar
Copy link
Contributor Author

another subtest: https://github.com/gravitational/teleport/actions/runs/11169042992/job/31048876887?pr=46747

        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:3237
        	            				/__w/teleport/teleport/integration/integration_test.go:2956
        	            				/__w/teleport/teleport/integration/integration_test.go:130
        	Error:      	Received unexpected error:
        	            	failed connecting to host 127.0.0.1:34095: failed to receive cluster details response
        	            		failed to dial target host
        	            		direct dialing to nodes not found in inventory is not supported
        	Test:       	TestIntegrations/TrustedClusters
    --- FAIL: TestIntegrations/TrustedClusters (30.99s)

@ravicious
Copy link
Member

https://github.com/gravitational/teleport/actions/runs/11480984785/job/31950631855#step:6:1976

     integration_test.go:4867: 
        	Error Trace:	/__w/teleport/teleport/integration/integration_test.go:4867
        	Error:      	Received unexpected error:
        	            	failed connecting to host localhost:24990: failed to receive cluster details response
        	            		failed to dial target host
        	            		direct dialing to nodes not found in inventory is not supported
        	Test:       	TestIntegrations/ProxyHostKeyCheck/Disabled

@rosstimothy rosstimothy self-assigned this Oct 24, 2024
@hugoShaka
Copy link
Contributor

TestIntegrations/TrustedClustersWithLabels failed with

failed connecting to host 127.0.0.1:40111: failed to receive cluster details response
    failed to dial target host
    connection error: desc = "transport: Error while dialing: failed to dial: cluster cluster-aux is offline"

Source: https://github.com/gravitational/teleport/actions/runs/11686293079/job/32541747209

The root cause seems to be the same as the TestIntegrations/TrustedClusters failure described above. I can file a new one if you think this is useful.

@rosstimothy
Copy link
Contributor

The crux of these issues seems to stem from the fact that we are waiting for the nodes to be visible by auth, but not waiting for them to be propagated to the proxy cache. As a result the check to wait for nodes passes, but when the dial request gets processed by the proxy it still doesn't know about the target node.

rosstimothy added a commit that referenced this issue Dec 5, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy cache.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `helpers.WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`.
rosstimothy added a commit that referenced this issue Dec 5, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy cache.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `helpers.WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`.
rosstimothy added a commit that referenced this issue Dec 5, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `helpers.WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
rosstimothy added a commit that referenced this issue Dec 5, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `helpers.WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
rosstimothy added a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `helpers.WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
rosstimothy added a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
github-actions bot pushed a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
rosstimothy added a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
rosstimothy added a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
Closes #47156.

All of the tests suffering from issues dialing hosts, and failing
with a `failed to dial target host` error were incorrectly waiting
for nodes to become visible before establishing connections. The
main culprit for most of the failures was `waitForNodesToRegister`,
though a few tests had a very similar hand rolled variant, which
incorrectly returned when the nodes appeard in Auth. However, since
the Proxy is the one performing dialing, they should have waited
for the nodes to appear in the Proxy.

To resolve, `waitForNodesToRegister` and all hand rolled equivalents
have been removed in favor of `(TeleInstance) WaitForNodeCount` which
correctly uses the `CachingAccessPoint` of the RemoteSite instead
of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated
to validate that the node watcher used for routing in the Proxy
also contained the expected number of nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants