Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E tests: connect to AKS using a Wireguard tunnel #4962

Open
ItalyPaleAle opened this issue Aug 2, 2022 · 5 comments
Open

E2E tests: connect to AKS using a Wireguard tunnel #4962

ItalyPaleAle opened this issue Aug 2, 2022 · 5 comments
Assignees
Labels

Comments

@ItalyPaleAle
Copy link
Contributor

In what area(s)?

/area test-and-release

Describe the proposal

One of the leading causes of E2E test failures (on AKS) continues to be network timeouts. These are commonly reported in the logs with errors such as "context deadline exceeded".

The issue, as confirmed by the Actions team, is that we are making too many outbound connections from the Actions runner to the Internet (especially, to the same few target IPs), in a short amount of time, and we are frequently exhausting our SNAT ports in the Actions runners (because they are behind a load balancer themselves).

Over the last few months, we have made some improvements so that our E2E test apps try to re-use TCP sockets as much as possible, thus limiting the number of outbound connections. Nevertheless, per the TCP dumps the Actions team analyzed, we are still making thousands of outbound connections, and there's not much more we can do to optimize on our end.

A few months ago (before we root-caused the issue to SNAT port exhaustion), I wrote a proposal for addressing the problem by moving the test controller (i.e. the code that is executed when running make e2e-test-all) into the Kubernetes cluster itself. I do believe this would solve the issue; however, it does require some significant re-architecting of the E2E test infrastructure. While running tests from within the AKS cluster is not impossible, coordinating the work with the Actions runner so that logs and errors are correctly reported requires careful consideration.

An alternative approach which I also floated with the Actions team in the issue linked above could be to use a tunnel, for example a Wireguard one, to maintain a connection between the Actions runner and the AKS cluster. In fact, if the problem is that we are making too many outbound connections, using a tunnel would make it so we'd have a grand total of.... 1 connection only!

Wireguard seems to be a solid protocol for implementing a tunnel because it can run in the userspace and can thus be executed directly into the AKS cluster, so no new infrastructure is needed, and it allows accessing all services in the cluster by connecting with their local IPs (bonus: tests would be faster because we don't need to wait for a service to get a LoadBalancer endpoint!). An example is here: https://github.com/ivanmorenoj/k8s-wireguard

The idea would be:

  • Change the Actions runner to create a Wireguard tunnel with the AKS cluster
    • This requires deploying a Wireguard server as K8s workload before starting the tests
  • The test controller already has a way to get private IPs. That's what we do when running E2E tests on KIND: we get the private IP of a K8s service rather than the external one from the load balancer. We need to re-use (possibly with some adaptations) that logic so the Actions runner can connect to services using private IPs too (e.g. something in the 10.0.0.0/8 range)
  • Tests should then be executed as usual, but with the proper routing rules in place, all traffic from the Actions runner to AKS would go through Wireguard.

Bonus: Can we use Tailscale?

There's also a bonus part to this, if possible. The link above uses "pure Wireguard" which needs to be exposed on (and accessible through) a public IP. An alternative approach could see us leverage a solution like Tailscale which would support NAT traversal.

This would be very, very nice for us, as it would not require the AKS cluster to have any inbound port open, and thus we should be able to move our test infra back into an internal Azure subscription! (Currently we are using an external "sponsored account", which works fine but it's an extra overhead for us to manage - we cannot use internal subscriptions because security rules close all inbound ports on resources deployed there) With Tailscale, both the AKS cluster and the Actions runner would be registered as ephemeral nodes.

I am less sure this is doable, but I believe it would use something like configuring Tailscale on AKS as a "subnet router"
On the Actions side, we'd use the official Tailscale Action

@mcandeia
Copy link
Contributor

mcandeia commented Aug 8, 2022

/assign

@mcandeia
Copy link
Contributor

I want to share some progress and caveats,

I managed to execute both tests in parallel with success, the linux AKS tests took ~18 minutes and the windows test took ~28 minutes.

  • With tailscale in place we must have a shared (possibly team) account which we should document how to access/generate keys and rotate them.
  • Tailscale is not free and we need a subnet router for each AKS running in parallel, the free subscription allows only 1 active subnet router (however, I was able to run at least 3 without any further problems)
    image (paid subscription is an option?)
  • Auth Keys expire in 1 month and do not have a way to extend these expiration dates (documentation says 90 days which diverges from the UI) and we need an automatic way to rotate and update them in the repository secrets.
  • Subnet CIDRs are restricted. Since the tailscale routing model is made by creating a virtual network between connected devices, they should not share the same CIDR otherwise it will cause IP ranges conflicts
    image. Currently, our CIDRs are fixed/random generated, one solution is to have a pool of CIDR and pick one that's not being used.

@mcandeia
Copy link
Contributor

mcandeia commented Aug 15, 2022

Bringing some bad news:

I asked the tailscale support team if there is a way to rotate the Auth Keys(to connect the network) using an API Key (to manage the console), and they answered that isn't.
image

There's been an ongoing issue since 2 Nov 2021 tailscale/tailscale#3243, seems that we should look for alternative solutions.

For me, it is a blocker for us.

I'll come back here with news as soon as I find something.

@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Oct 14, 2022
@ItalyPaleAle
Copy link
Contributor Author

👋🤖

@dapr-bot dapr-bot removed the stale Issues and PRs without response label Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants