Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for Robot Servers (Syself Fork) #523

Closed
4 tasks done
apricote opened this issue Oct 4, 2023 · 8 comments
Closed
4 tasks done

feat: Support for Robot Servers (Syself Fork) #523

apricote opened this issue Oct 4, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@apricote
Copy link
Member

apricote commented Oct 4, 2023

Summary

We intend to merge the changes from syself/hetzner-cloud-controller-manager to provide support for Hetzner Robot (Dedicated/Bare Metal) servers in addition to cloud servers.

This issue will track the various tasks necessary.

Subtasks

Design Doc

We have written an internal Design Doc to figure out what exactly this means and how we want to tackle the different aspects. For transparency you can read it by expanding the section below.

Design Doc

Support Robot Servers in HCCM

Motivation

hcloud-cloud-controller-manager (HCCM) is used to expose certain Cloud functionality to Kubernetes Clusters. This includes Node Metadata, Network Routes & Load Balancers.

Right now, HCCM only supports the Hetzner Cloud API & cloud servers. Many customers have hybrid clusters running on Hetzner Dedicated (Robot) & Cloud servers. The underlying library of HCCM kubernetes/cloud-provider only works, if all nodes in the cluster can be managed by a single implementation. Effectively, this means, that clusters with a current version of hccm remove any non-Cloud Nodes from the Kubernetes Cluster.

The Hetzner Cloud API already works with Robot in a few scenarios:

  • Robot servers can be added as Load Balancer Targets
  • Robot servers can be attached to private Networks through vSwitches

In addition, we can use the Robot API to:

  • Report Node Metadata

We have had a number of requests for this feature through GitHub.

There already exists a fork of HCCM by Syself that implements Robot support. They have offered to "donate" this to us, so we can use it as a starting point.

Implementation

Robot Client

The Robot team provides a Rest API: https://robot.hetzner.com/doc/webservice/en.html#preface

The Robot team does not publish their own Go API Client. There is an open source client that is partially maintained by Syself. It is currently used in the Syself HCCM fork. As there is no better option, we will continue to use it.

We need to create and inject the client in our application similar to the hcloud-go client. This needs to be optional to not break existing setups.

Testing

We will need to set up a new CI workflow to verify the Robot support with a dedicated server. We might need to add some additional test cases to validate that it works across the different servers.

This test needs to be optional, as many people use HCCM in a cloud-only config.

To make sure that only one pipeline is using the dedicated servers at a time, we will use the GitHub Actions concurrency Features. This means that we can only use it from GitHub Actions and not from our internal GitLab.

The server will be bootstrapped using installimage with autosetup. The node is then joined to our existing cluster using k3sup, same as the Cloud servers.

Cloud Provider Controllers
Node Controller / InstancesV2

This controller adopts the instance initially and makes the connection to our APIs (ProviderID).

It also returns metadata info about the node.

We always need to know which nodes belong to which "source". We can save this info to the ProviderID field. Our existing Cloud servers use the pattern hcloud://<SERVER-ID>. For Robot, we will use hrobot://<SERVER-ID>. This differs from the Syself Fork, they use hcloud://bm-<ROBOT-ID>. We will also allow reading the Syself format, to enable users to migrate from the fork to our HCCM.

These fields from the InstancesV2 interface have restrictions for Robot servers:

  • Shutdown status is not available for server types using a tower case, this mostly affects older models
  • Node Addresses are only partially available through the API. Only the Public IP (or at least one of) is definitely set in the API. Any private IPs in vSwitches are not visible, and the hostname is also not available.
  • The Zone (Datacenter Name) is lower-cased for Hetzner Cloud API and upper-cased for Hetzner Robot API, we should normalize this to the lower case.
  • The Region (Location Name) needs to be parsed from the Zone (Datacenter Name)

If the Robot support is not enabled, and we encounter a Node that we can not associate with any Cloud server, we should log a warning. This warning should inform the user that the Node was removed and if they are trying to add Robot servers to their cluster that they should enable the Robot support.

Route Controller / Routes

Using the native routing feature should be possible since launching expose_routes_to_vswitch. This can be hard to implement and especially verify. We will not include support for this in the first release, and based on customer demand we might introduce this at a later time.

Service Controller / LoadBalancer

We can add the Robot servers to the load balancer target list through their IP. We can get the IP from the Node object.

Documentation

This is a major new feature that needs to be thoroughly documented:

  • How to configure hccm with Robot
  • What are the requirements?
    • To find the right server matching the K8s node: Naming of Node matches Robot? Alternatively, users would need to set the providerID themselves
  • Which features are implemented? How do they work?

Alternatives

No support

We can just decide to not support Robot servers. This sucks from a customers perspective because they do not care that Robot & Cloud are different teams/companies. Our Cloud APIs already integrate with Robot on some accounts, so this should also be supported in our integrations.

Forking the hrobot-go client

If we do this in an official manner, customers assume that we are responsible for maintenance and might demand fixes/features. This is out of scope for our team, and might be better owned by Robot. They have no (official) interest in this at this time.

If we encounter issues with hrobot-go, we can still fork it and use a replace directive in go.mod to quickly release fixes.

@apricote apricote added the enhancement New feature or request label Oct 4, 2023
@apricote apricote self-assigned this Oct 4, 2023
@maaft
Copy link

maaft commented Oct 4, 2023

great news! <3

@pservit
Copy link

pservit commented Oct 5, 2023

@apricote
It would be nice to have the ability to set internal IP (robot vSwitch / vlan) for bare-metal nodes.
Now we use syself fork with a dirty hack to get internal IP from robot server name (ex. "bm-node-0/10.3.1.2" )

@apricote
Copy link
Member Author

apricote commented Oct 6, 2023

@apricote It would be nice to have the ability to set internal IP (robot vSwitch / vlan) for bare-metal nodes. Now we use syself fork with a dirty hack to get internal IP from robot server name (ex. "bm-node-0/10.3.1.2" )

As mentioned in the Design Doc, this data is not available in the API, and its not possible for us to provide that without deploying a DaemonSet that reads the info from the nodes:

Any private IPs in vSwitches are not visible, and the hostname is also not available.

apricote added a commit that referenced this issue Oct 6, 2023
This utility function was duplicated with nearly the exact same
functionality. This commit cleans it up by extracting to a new package
(to avoid cyclic imports).

These two methods are about to get more complicated with #523, better to
clean it up now than to make changes to both locations in the future.

---------

Co-authored-by: Jonas L. <[email protected]>
apricote added a commit that referenced this issue Nov 21, 2023
Based on the Fork by Syself[0] and the Design Doc[1].

[0] https://github.com/syself/hetzner-cloud-controller-manager
[1] #523 (comment)

This ports most features of the fork while refactoring them to match
our coding style and the improvements I made in preparation for this.

Closes #525 #526 #527

---------

Co-authored-by: janiskemper <[email protected]>
Co-authored-by: Mawe Sprenger <[email protected]>
Co-authored-by: Thomas Guettler <[email protected]>
Co-authored-by: Anurag <[email protected]>
Co-authored-by: batistein <[email protected]>
@apricote
Copy link
Member Author

Basic Support is merged (#561). I will add documentation & migration guide before cutting a release with the feature.

There is also a number of features from the Syself Fork which got removed in the first PR, which will also be submitted before a release.

@apricote
Copy link
Member Author

apricote commented Dec 1, 2023

@jahanson
Copy link

jahanson commented Dec 2, 2023

Just checking in, I wanted to say v1.19.0-rc.0 Is working great for me and the new Robot code.

image

@apricote
Copy link
Member Author

apricote commented Dec 4, 2023

Forgot to provide an update here: I published a pre-release on Friday (v1.19.0-rc.0) to make it simple for anyone interested to test it.

I plan on publishing the proper release on Wednesday (2023-12-06), so if anyone finds any bugs, now is the time to open an issue :)


Great to hear its working for you @jahanson!

@apricote
Copy link
Member Author

apricote commented Dec 7, 2023

Released in 1.19.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants