-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestIntegrations
flakiness
#47156
Comments
@zmb3 who would be best to assign this since Alex is no longer with Teleport? |
another subtest: https://github.com/gravitational/teleport/actions/runs/11169042992/job/31048876887?pr=46747
|
|
Source: https://github.com/gravitational/teleport/actions/runs/11686293079/job/32541747209 The root cause seems to be the same as the |
The crux of these issues seems to stem from the fact that we are waiting for the nodes to be visible by auth, but not waiting for them to be propagated to the proxy cache. As a result the check to wait for nodes passes, but when the dial request gets processed by the proxy it still doesn't know about the target node. |
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy cache. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `helpers.WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy cache. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `helpers.WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `helpers.WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `helpers.WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `helpers.WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Closes #47156. All of the tests suffering from issues dialing hosts, and failing with a `failed to dial target host` error were incorrectly waiting for nodes to become visible before establishing connections. The main culprit for most of the failures was `waitForNodesToRegister`, though a few tests had a very similar hand rolled variant, which incorrectly returned when the nodes appeard in Auth. However, since the Proxy is the one performing dialing, they should have waited for the nodes to appear in the Proxy. To resolve, `waitForNodesToRegister` and all hand rolled equivalents have been removed in favor of `(TeleInstance) WaitForNodeCount` which correctly uses the `CachingAccessPoint` of the RemoteSite instead of `GetClient`. Additionally, `helpers.WaitForNodeCount` was updated to validate that the node watcher used for routing in the Proxy also contained the expected number of nodes.
Failure
Link(s) to logs
Relevant snippet
I'm making this issue to cover multiple subtests of
TestIntegrations
that are all flaky for the same inventory error reason:TestIntegrations/PAM/Disabled
flakiness #46730TestIntegrations/Disconnection/client_idle_timeout_node_recoding
flakiness #39515TestIntegrations/Disconnection/expired_cert_node_recording
flakiness #39459TestIntegrations/DataTransfer
flakiness #33519TestIntegrations/Disconnection/concurrent_connection_limits_exceeded_node_recording
flakiness #31722TestIntegrations/Disconnection/client_idle_timeout_proxy_recording
flakiness #24358TestIntegrations/MultiplexingTrustedClusters
flakiness #48823TestIntegrations/ProxyHostKeyCheck
(this issue)TestIntegrations/TrustedClusters
(also this issue)The text was updated successfully, but these errors were encountered: