-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E tests: connect to AKS using a Wireguard tunnel #4962
Comments
/assign |
I want to share some progress and caveats, I managed to execute both tests in parallel with success, the linux AKS tests took ~18 minutes and the windows test took ~28 minutes.
|
Bringing some bad news: I asked the tailscale support team if there is a way to rotate the Auth Keys(to connect the network) using an API Key (to manage the console), and they answered that isn't. There's been an ongoing issue since 2 Nov 2021 tailscale/tailscale#3243, seems that we should look for alternative solutions. For me, it is a blocker for us. I'll come back here with news as soon as I find something. |
This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions. |
👋🤖 |
In what area(s)?
/area test-and-release
Describe the proposal
One of the leading causes of E2E test failures (on AKS) continues to be network timeouts. These are commonly reported in the logs with errors such as "context deadline exceeded".
The issue, as confirmed by the Actions team, is that we are making too many outbound connections from the Actions runner to the Internet (especially, to the same few target IPs), in a short amount of time, and we are frequently exhausting our SNAT ports in the Actions runners (because they are behind a load balancer themselves).
Over the last few months, we have made some improvements so that our E2E test apps try to re-use TCP sockets as much as possible, thus limiting the number of outbound connections. Nevertheless, per the TCP dumps the Actions team analyzed, we are still making thousands of outbound connections, and there's not much more we can do to optimize on our end.
A few months ago (before we root-caused the issue to SNAT port exhaustion), I wrote a proposal for addressing the problem by moving the test controller (i.e. the code that is executed when running
make e2e-test-all
) into the Kubernetes cluster itself. I do believe this would solve the issue; however, it does require some significant re-architecting of the E2E test infrastructure. While running tests from within the AKS cluster is not impossible, coordinating the work with the Actions runner so that logs and errors are correctly reported requires careful consideration.An alternative approach which I also floated with the Actions team in the issue linked above could be to use a tunnel, for example a Wireguard one, to maintain a connection between the Actions runner and the AKS cluster. In fact, if the problem is that we are making too many outbound connections, using a tunnel would make it so we'd have a grand total of.... 1 connection only!
Wireguard seems to be a solid protocol for implementing a tunnel because it can run in the userspace and can thus be executed directly into the AKS cluster, so no new infrastructure is needed, and it allows accessing all services in the cluster by connecting with their local IPs (bonus: tests would be faster because we don't need to wait for a service to get a LoadBalancer endpoint!). An example is here: https://github.com/ivanmorenoj/k8s-wireguard
The idea would be:
Bonus: Can we use Tailscale?
There's also a bonus part to this, if possible. The link above uses "pure Wireguard" which needs to be exposed on (and accessible through) a public IP. An alternative approach could see us leverage a solution like Tailscale which would support NAT traversal.
This would be very, very nice for us, as it would not require the AKS cluster to have any inbound port open, and thus we should be able to move our test infra back into an internal Azure subscription! (Currently we are using an external "sponsored account", which works fine but it's an extra overhead for us to manage - we cannot use internal subscriptions because security rules close all inbound ports on resources deployed there) With Tailscale, both the AKS cluster and the Actions runner would be registered as ephemeral nodes.
I am less sure this is doable, but I believe it would use something like configuring Tailscale on AKS as a "subnet router"
On the Actions side, we'd use the official Tailscale Action
The text was updated successfully, but these errors were encountered: