RFC: auto-generation of TLS certificates for new clusters and newly added nodes #51991

aaron-crl · 2020-07-28T13:19:35Z

This is a proposal to add functionality to improve secure cluster creation and node joining to CockroachDB. It is still very much a work in progress but I felt it would be more valuable to get discussion moving sooner than present a more complete product.

https://github.com/aaron-crl/cockroach/blob/rfc20200722_cert_free_secure_setup/docs/RFCS/20200722_certificate_free_secure_setup.md

cockroach-teamcity · 2020-07-28T13:19:42Z

This change is

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @chrisseto, and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

This introduces the concept of a `join-token` that is provided to a `node` to permit it to join an existing cluster. `nodes` will self manage certificates and trust on cluster internal interfaces enabling operators to focus on configuring and securing external access.

Borrowing from the "secure" guide (https://www.cockroachlabs.com/docs/v20.1/secure-a-cluster.html) this feature will allow all four of the existing certificate management steps to fall away. In their place an additional output from the first step of "Start the Cluster" will emit the two join tokens for the second and third nodes. These join tokens would then be added to the start lines for each node and the cluster would start with all internode traffic secured by strong TLS.

This sounds great for the manual "start a cluster" process. But it seems challenging to orchestrate in kubernetes (or k8s-like environments). The problem is that in general, you want to treat all your cockroach nodes as peer tasks of a single job, but this model expects you to be able to construct all their command-line arguments up front, before any of the nodes have started.

In our k8s operator we can handle an arbitrarily-complex initialization process, but any complexity we build into the operator is something that non-k8s users will also need to build into their processes.

In this context it may actually be simpler to split out the CA functionality into a separate job, at least during the initialization phase: start a cockroach-ca process, request join tokens from it, then pass those join tokens to all the regular cockroach processes (who will then exchange their join tokens for certs from cockroach-ca). But then we've just shifted the bootstrapping problem around a bit.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 32 at r1 (raw file):

This removes the _requirement_ that operators manage inter-node certificates themselves. They may still do so if they have functional or business needs that do not permit this pattern. This will also help with orchestration as any node may then be used to create a `join-token` for another node providing the same high availability as any other CockroachDB feature. The ability to create these tokens can be gated behind the same permissions as those used to add and remove nodes to existing clusters.

This removes strict external dependence on fully qualified domain names ([TBD])for internode TLS removing the need to have internode certificates signed by globally trusted Certificate Authorities and reducing the blast damage of a compromised CockroachDB certificate.

We already do not depend on FQDNs (unless you're using a CA that imposes such a requirement). All you need is a hostname that resolves in the context of the cluster, or even an IP address.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 62 at r1 (raw file):

*I’m not sure how init works so we may have to tweak this around joining initial nodes to the base cluster before the init command is sent.*

My hypothesis is that we can use the root cert to request `join-token`s from the first node and use them to join other nodes to the first node before issuing the init command. If this presents complications it might be easier to init the first node and then grow the cluster with `join-token`s immediately after.

I don't think this works - the init command is what tells the first node that it is the first node, so the first-node process can't run until we've initialized the cluster.

knz

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, and @chrisseto)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 5 at r1 (raw file):

- Start Date: 2020-07-22
- Authors: @aaron-crl
- RFC PR: (PR # after acceptance of initial draft)

You can now update this line and refer to #51991 (this PR).

docs/RFCS/20200722_certificate_free_secure_setup.md, line 16 at r1 (raw file):

Certificates and their management have traditionally been major sources of toil and confusion for system administrators and users across all sectors that depend upon them for trust. Our current user story for CockroachDB does not stand out against this tradition.

After reviewing the existing CockroachDB trust mechanisms, it seems that effort and complexity of certificate management drives users to test and prototype in `--insecure-mode` rather than the default (secure) state. Our getting started guide highlights this well as the "insecure" guide (https://www.cockroachlabs.com/docs/v20.1/start-a-local-cluster.html) starts with `cockroach start` whereas the "secure" guide (https://www.cockroachlabs.com/docs/v20.1/secure-a-cluster.html) has four complex certificate generation steps before the user may even start their cluster.

nit: the command-line flag is --insecure not --insecure-mode

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This sounds great for the manual "start a cluster" process. But it seems challenging to orchestrate in kubernetes (or k8s-like environments). The problem is that in general, you want to treat all your cockroach nodes as peer tasks of a single job, but this model expects you to be able to construct all their command-line arguments up front, before any of the nodes have started.

In our k8s operator we can handle an arbitrarily-complex initialization process, but any complexity we build into the operator is something that non-k8s users will also need to build into their processes.

In this context it may actually be simpler to split out the CA functionality into a separate job, at least during the initialization phase: start a cockroach-ca process, request join tokens from it, then pass those join tokens to all the regular cockroach processes (who will then exchange their join tokens for certs from cockroach-ca). But then we've just shifted the bootstrapping problem around a bit.

I've thought a bit about this and there are indeed two obstacles to the current proposal:

it creates an assymetry between "first node" and "other nodes" which makes orchestration cumbersome
it didn't take into account that our cluster boostrapping process enables nodes to join each other before the cluster is initialized and a "first node" is selected. (This feature is desirable so I don't think we want to prevent it)

But we can still move forward with join tokens, here's how.

First we need to identify the two use cases:

use case 1: bring a cluster together from scratch, that's not been boostrapped yet (and ensure the process is homogeneous for all nodes)
use case 2: adding/joining more nodes to an existing cluster that's already been bootstrapped

Now we can define a single mechanism that works with both:

define join tokens to include its own pre-generated CA cert and private key, and sign the remainder of the join token with it
make join tokens generatable by a CLI command without a cluster running already - by generating the CA cert and remainder of the join token from scratch
from there, cover the use cases as follows:
- for use case 1:
  - make each node save the provided pre-generated CA to disk, then validate peer-to-peer that they are working with the same join token by verifying the signature of the non-CA part of the token sent over to peers. At this point, they have not bootstrapped yet. This can be done by having each RPC connect verify the token signature as part of the RPC conn handshake.
  - once everyone has confirmed, proceed with bootstrap as usual.
- for use case 2:
  - sign the CA in the join token using the CA already set up for the cluster (this can be done automatically in our CLI command if we already have a cert dir to work with; or it can be done using a regular CSR with an external CA)
  - provide the token to the node to be added. The new node presents the non-CA part of the token with CA signature to the other node(s) from the already existing cluster.
  - the other nodes (already initialized) look at the signature and check that the token's signing CA has been signed by the CA they arelady know about. IF the signatures match, accept the join.

What do y'all think?

…ack"

aaron-crl · 2020-07-31T18:41:34Z

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @chrisseto, and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):
This introduces the concept of a `join-token` that is provided to a `node` to permit it to join an existing cluster. `nodes` will self manage certificates and trust on cluster internal interfaces enabling operators to focus on configuring and securing external access.

Borrowing from the "secure" guide (https://www.cockroachlabs.com/docs/v20.1/secure-a-cluster.html) this feature will allow all four of the existing certificate management steps to fall away. In their place an additional output from the first step of "Start the Cluster" will emit the two join tokens for the second and third nodes. These join tokens would then be added to the start lines for each node and the cluster would start with all internode traffic secured by strong TLS.
This sounds great for the manual "start a cluster" process. But it seems challenging to orchestrate in kubernetes (or k8s-like environments). The problem is that in general, you want to treat all your cockroach nodes as peer tasks of a single job, but this model expects you to be able to construct all their command-line arguments up front, before any of the nodes have started.

I'm not fluent in kubernetes yet so I defer to those with more knowledge (@chrisseto ?). We may be able to address this with an init container that issues a start command and provision one node as the leader with a time-bounded init parameter (if I captured our discussion accurately). Certs would be written to their normal location on the container filesystem after the leader started and they could be used by an orchestration solution to request join tokens for the remaining processes.

In our k8s operator we can handle an arbitrarily-complex initialization process, but any complexity we build into the operator is something that non-k8s users will also need to build into their processes.

In this context it may actually be simpler to split out the CA functionality into a separate job, at least during the initialization phase: start a cockroach-ca process, request join tokens from it, then pass those join tokens to all the regular cockroach processes (who will then exchange their join tokens for certs from cockroach-ca). But then we've just shifted the bootstrapping problem around a bit.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 32 at r1 (raw file):
This removes the _requirement_ that operators manage inter-node certificates themselves. They may still do so if they have functional or business needs that do not permit this pattern. This will also help with orchestration as any node may then be used to create a `join-token` for another node providing the same high availability as any other CockroachDB feature. The ability to create these tokens can be gated behind the same permissions as those used to add and remove nodes to existing clusters.

This removes strict external dependence on fully qualified domain names ([TBD])for internode TLS removing the need to have internode certificates signed by globally trusted Certificate Authorities and reducing the blast damage of a compromised CockroachDB certificate.
We already do not depend on FQDNs (unless you're using a CA that imposes such a requirement). All you need is a hostname that resolves in the context of the cluster, or even an IP address.

Thanks! Updated language to reflect this.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 62 at r1 (raw file):
*I’m not sure how init works so we may have to tweak this around joining initial nodes to the base cluster before the init command is sent.*

My hypothesis is that we can use the root cert to request `join-token`s from the first node and use them to join other nodes to the first node before issuing the init command. If this presents complications it might be easier to init the first node and then grow the cluster with `join-token`s immediately after.
I don't think this works - the init command is what tells the first node that it is the first node, so the first-node process can't run until we've initialized the cluster.

Updated this section to reflect current thinking after our discussion.

aaron-crl · 2020-07-31T18:56:12Z

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, and @chrisseto)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 5 at r1 (raw file):
- Start Date: 2020-07-22
- Authors: @aaron-crl
- RFC PR: (PR # after acceptance of initial draft)
You can now update this line and refer to #51991 (this PR).

Thanks! Done.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 16 at r1 (raw file):

Certificates and their management have traditionally been major sources of toil and confusion for system administrators and users across all sectors that depend upon them for trust. Our current user story for CockroachDB does not stand out against this tradition.

After reviewing the existing CockroachDB trust mechanisms, it seems that effort and complexity of certificate management drives users to test and prototype in `--insecure-mode` rather than the default (secure) state. Our getting started guide highlights this well as the "insecure" guide (https://www.cockroachlabs.com/docs/v20.1/start-a-local-cluster.html) starts with `cockroach start` whereas the "secure" guide (https://www.cockroachlabs.com/docs/v20.1/secure-a-cluster.html) has four complex certificate generation steps before the user may even start their cluster.

nit: the command-line flag is --insecure not --insecure-mode

Thanks, corrected.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…
I've thought a bit about this and there are indeed two obstacles to the current proposal:

it creates an assymetry between "first node" and "other nodes" which makes orchestration cumbersome

it didn't take into account that our cluster boostrapping process enables nodes to join each other before the cluster is initialized and a "first node" is selected. (This feature is desirable so I don't think we want to prevent it)

But we can still move forward with join tokens, here's how.

First we need to identify the two use cases:

use case 1: bring a cluster together from scratch, that's not been boostrapped yet (and ensure the process is homogeneous for all nodes)

use case 2: adding/joining more nodes to an existing cluster that's already been bootstrapped

Now we can define a single mechanism that works with both:

define join tokens to include its own pre-generated CA cert and private key, and sign the remainder of the join token with it

make join tokens generatable by a CLI command without a cluster running already - by generating the CA cert and remainder of the join token from scratch

from there, cover the use cases as follows:

for use case 1:

make each node save the provided pre-generated CA to disk, then validate peer-to-peer that they are working with the same join token by verifying the signature of the non-CA part of the token sent over to peers. At this point, they have not bootstrapped yet. This can be done by having each RPC connect verify the token signature as part of the RPC conn handshake.

once everyone has confirmed, proceed with bootstrap as usual.

for use case 2:

sign the CA in the join token using the CA already set up for the cluster (this can be done automatically in our CLI command if we already have a cert dir to work with; or it can be done using a regular CSR with an external CA)

provide the token to the node to be added. The new node presents the non-CA part of the token with CA signature to the other node(s) from the already existing cluster.

the other nodes (already initialized) look at the signature and check that the token's signing CA has been signed by the CA they arelady know about. IF the signatures match, accept the join.

What do y'all think?

I'm disinclined to include the node cert private keys in the join token for two reasons.

I'd like to keep the size of join tokens smaller for ease of use outside of automation
I want to avoid leaking any persistent cryptologic primitives outside of the cluster. Single use join tokens are consumed upon use (and have a wall clock expiration) reducing the risk associated with them spilling. However, if we pack the node cert into the join token and the token is later leaked, an attacker potentially has the means to authenticate as a node to the cluster.

I updated the doc to include an explicit token consumption step.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @chrisseto, and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

Previously, aaron-crl wrote…

I'm disinclined to include the node cert private keys in the join token for two reasons.

I'd like to keep the size of join tokens smaller for ease of use outside of automation

I want to avoid leaking any persistent cryptologic primitives outside of the cluster. Single use join tokens are consumed upon use (and have a wall clock expiration) reducing the risk associated with them spilling. However, if we pack the node cert into the join token and the token is later leaked, an attacker potentially has the means to authenticate as a node to the cluster.

I updated the doc to include an explicit token consumption step.

I agree with Raphael that fundamentally there are two operations: cluster creation and scaling. More to the point, cluster creation fundamentally always creates a single-node cluster. So we really have (mod caveat below):

single-node cluster creation
adding nodes to a cluster

is already a manual (or scripted) operator action, via the ./cockroach init command. This command can in addition initialize an internal CA.

That leaves the question of how nodes will be allowed to join this single-node cluster. I think the way it ought to work is that there is a command that you run against the existing cluster to create a single-use join token, which is then passed to the new node. Someone from CC should chime in, but that seems perfectly reasonable from an operator perspective. Note that I don't want to require that the new node contacts the node that issued the join token in particular. As long as it contacts any of them, things should work out; the cluster should be able to verify the join token even if the issuer went down (or at least route internally, but why not put it in a replicated table?).

What's probably awkward in practice is tying the join token to a hostname. That's all fine if you're deploying CRDB as a stateful set (so you can predict the hostname), but what if you don't? You want to spin up the pod only when you have the join token, but then you may not learn the host name early enough. I wonder if we can drop that requirement altogether and treat the join token like a private key - if you leak a private key while provisioning it to the target node, then you lose. Perhaps some amount of protection can be retained, by attaching some wildcard to the join token that must be satisfied? You don't have to predict the exact host name, but you can limit who can use the token however tightly your deployment allows.

re: the caveat, is that today the process of adding a cluster may seem as though it bootstraps it with as many nodes as you would like. In particular, the join flags may form a complicated graph that (if sufficiently tightly connected to the node the cluster will init on) will allow all nodes to join. For example you could have a fresh cluster with join flags:

n3 joins n2
n2 joins n4
n4 joins n1
n1 gets inited

In this case, when all nodes are started and then n1 is inited, n4 would first manage to join the cluster via n1, then n2 would be able to join it via n4, then n3 via n2. This would simply not work in the same way with tokens since you could not start n2-n4 without a join token, and you would not be able to get one until n1 is init'ed. However - I've always felt that this flexibility was a bad idea as it's really hard to explain when it works and when it doesn't (even when the join flags form a connected graph, it's not necessarily going to resolve, due to the sparsity of where gossip actually connects to).
I would in principle support a smaller number of ways in which clusters can be set up and scaled if that means that we can simplify and streamline the deployments, which it seems would be the result here: you start a cluster as a single node cluster, and there's a defined process to adding nodes, and that's it.

chrisseto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

Someone from CC should chime in, but that seems perfectly reasonable from an operator perspective.

Yep! We could pretty easily have an init container that mints the token then writes to a temp file or just have a subshell in the crdb container cockroach start --join-token=$(cockroach get-join-token ...)

What's probably awkward in practice is tying the join token to a hostname.

From kubernetes in general this isn't too difficult. Any pod or init container can run hostname.
That being said, hostnames are not necessarily unique per crdb node.

Consider the situation where one has a statefulset with 3 replicas.
Replicas is scaled up to 4 then back to 3.
The PV/PVC for replica 4 is removed.
Replicas is then scaled up to 4 again.

crdb-3.cluster.local has now represented 2 different crdb nodes.
They can't be present in the system at the same time so it may not matter? but baking hostnames into the join token may lead to a false understanding of the importance of hostnames in a cockroach cluster?

aaron-crl · 2020-08-04T14:16:23Z

I'm happy to abandon the hostname constraint as long as we leave the other means of server authentication in the token. The intent for including it was to make it difficult for an attacker to capture a join token for replay while pretending to be the target node. I expected that the join tokens would be replicated across the cluster regardless so this makes more sense in that regard too.

I'll update the RFC to reflect this and include some of the attack patterns we hope to prevent later today.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

From kubernetes in general this isn't too difficult. Any pod or init container can run hostname.

Would we be able to obtain the join flag from the new node though? I think you need certs to do that, and that's what we're trying to get. I was thinking that we would execute this from either a one-off pod or an existing node, before you've even started adding the new node (hence my concern about the hostname).
Taking a step back, isn't, for a k8s-backed deployment, the "canonical" way to "just" pass the CA secret key to each node? That way they can mint their own cert before they join - and it doesn't look like we're regressing on security, since after the join token the node also gets handed that very same root cert

aaron-crl · 2020-08-04T17:09:14Z

Taking a step back, isn't, for a k8s-backed deployment, the "canonical" way to "just" pass the CA secret key to each node? That way they can mint their own cert before they join - and it doesn't look like we're regressing on security, since after the join token the node also gets handed that very same root cert

After considering this, I like this a lot. Let the operator generate the initial certs for sure (we would know those constraints). I don't know enough about how we add or remove nodes to k8s configurations to speak to the add/remove operation but this sounds cleaner to me than using the tokens for initial k8s starts.

aaron-crl · 2020-08-04T17:13:50Z

Counter point to the above. This would require us to keep two certificate generation paths in sync if we ever made changes to the way we produce certificate for inter-node use.

Stubbed out region for k8s notes

aaron-crl · 2020-08-04T20:35:42Z

I removed hostname from the join-token and fleshed out the validation steps a little. Will need to chat with folks more about the k8s constraints to feel comfortable addressing those areas.

bdarnell · 2020-08-05T20:04:26Z

Taking a step back, isn't, for a k8s-backed deployment, the "canonical" way to "just" pass the CA secret key to each node? That way they can mint their own cert before they join - and it doesn't look like we're regressing on security, since after the join token the node also gets handed that very same root cert

So essentially what we have today, but the cockroach nodes know how to generate all their certs from a CA key if that's what they're given? That does remove some steps, but by leaving the CA front-and-center, I'm not sure it's meeting all of the goals of this proposal (including ones that @aaron-crl and I have discussed that didn't seem to make it into the doc).

I'm not sure that there is a "canonical" way to do this kind of thing in k8s - that's why we've struggled to define our own patterns. To the extent there is a canonical pattern, it probably involves a single shared symmetric key instead of all this complexity with CAs. The CA/cert model gives you the potential to do various more-secure things like storing the CA private key in an HSM, although I don't know if anyone is successfully realizing that level of security with CRDB today.

knz · 2020-08-25T12:15:59Z

@bobvawter and I chatted about this recently. There is a dimension that is not yet clearly covered by the RFC, that of distinguishing the following scenarios:

adding a new region to a k8s deployment. In that case, the k8s network are disjoint and there is a need to validate the trust of the cross-k8s node-node connections. Then the proposal of this RFC applies as-is: we populate the shared secret in the new region's k8s configuration to bootstrap the first node(s) there.

One point here that would benefit a callout in the RFC: the importance of an expiration deadline.

The secret will be set up in the k8s config but is likely to remain there, even after the new region has been bootstrapped (operators typically don't go back to change a configuration "that works"). A stale secret getting in the wrong hands should not enable adding nodes elsewhere.
adding a new node onto a dedicated k8s cluster, and in general adding nodes to local privileged networks.

In this case, from a UX perspective even requiring a shared secret token is cumbersome and unnecessary. IF the network is privileged, only the operators can run proceses on it.

It should be possible to instruct nodes to accept new joins (and cert exchange) from the same local network.

This would be configured as a "trusted network prefix/mask" on which no secret is needed to accept the cert exchange handshake. Protection against operator mistakes can be achieved using --cluster-name instead of the join token.

This feature is necessary to enable seamless (And hassle-less) scale-up of local k8s clusters and other enterprise deployments on trusted networks.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, and @knz)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 26 at r1 (raw file):

# Guide-level explanation

This introduces the concept of a `join-token` that is provided to a `node` to permit it to join an existing cluster. `nodes` will self manage certificates and trust on cluster internal interfaces enabling operators to focus on configuring and securing external access.

This doc currently only mentions the operator experience. Perhaps it's implicit in the design, but I'd also like to at least see a mention of the developer/application/client experience. What does the connection string look like now? How do applications and ORMs connect to this default-secure database?

Relatedly: what needs to change in all of our own existing tests that use insecure mode?

docs/RFCS/20200722_certificate_free_secure_setup.md, line 34 at r1 (raw file):

with minor adjustment to their workflow.

Similar to my comment above, I would like to be more specific about which adjustments to the workflow are required to test in this new state (as compared to insecure mode).

docs/RFCS/20200722_certificate_free_secure_setup.md, line 75 at r1 (raw file):

Success will present itself through:
- A decrease in the percentage of clusters running in insecure mode
- Reduce operator toil and friction easing adoption of the product where proof of concepts may have been conducted with the --insecure-mode flag

I think some part of the success monitoring should capture the developer experience as well. This change should make it no harder for applications/clients to connect to the database, and we should reflect that in the goals/success criteria.

* Added potential solutions for initial startup goals to Guide-level explanation.

aaron-crl

Re: 1. The secret should be consumed during joining rendering it useless to an attacker.

Re: 2. This makes me queasy from a security standpoint. If I'm understanding correctly, this would mean that any local network access within a k8s environment could translate directly to admin level access to the database by presenting as a node. That seems like a pretty severe risk which we might be better avoid with a good k8s operator experience.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @knz, and @rafiss)

docs/RFCS/20200722_certificate_free_secure_setup.md, line 5 at r1 (raw file):

Previously, aaron-crl wrote…

Thanks! Done.

Done.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 26 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

This doc currently only mentions the operator experience. Perhaps it's implicit in the design, but I'd also like to at least see a mention of the developer/application/client experience. What does the connection string look like now? How do applications and ORMs connect to this default-secure database?

Relatedly: what needs to change in all of our own existing tests that use insecure mode?

Thank you for jumping in! These are concerns that are worth a callout in the goals section and I'm adding them accordingly. Guide level is still a work in progress as we refine how the technical approach is going to flow. I'll leave this and your other comments open for the present until they are addressed.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 28 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

From kubernetes in general this isn't too difficult. Any pod or init container can run hostname.

Would we be able to obtain the join flag from the new node though? I think you need certs to do that, and that's what we're trying to get. I was thinking that we would execute this from either a one-off pod or an existing node, before you've even started adding the new node (hence my concern about the hostname).
Taking a step back, isn't, for a k8s-backed deployment, the "canonical" way to "just" pass the CA secret key to each node? That way they can mint their own cert before they join - and it doesn't look like we're regressing on security, since after the join token the node also gets handed that very same root cert

I've pushed some more information into a Goals section and some initial explorations of how to handle startup informed by these comments and conversations with @bdarnell .

Regarding the nodes being able to print their own join-token, that shouldn't work in the existing proposal because all join-tokens are checked at time of join within the database to avoid reuse.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 32 at r1 (raw file):

Previously, aaron-crl wrote…

Thanks! Updated language to reflect this.

Done.

docs/RFCS/20200722_certificate_free_secure_setup.md, line 62 at r1 (raw file):

Previously, aaron-crl wrote…

Updated this section to reflect current thinking after our discussion.

Done.

…odes This PR implements the init protocol phase of the cert-free setup described in cockroachdb#51991. A lot of the code is pulled out of Aaron's reference implementation of this protocol: https://github.com/aaron-crl/toy-secure-init-handshake/tree/n-way-join One part of cockroachdb#60632. Release note: None.

60766: server: Implement auto-init handshake to negotiate CA certs between nodes r=aaron-crl a=itsbilal This PR implements the init protocol phase of the cert-free setup described in #51991. A lot of the code is pulled out of Aaron's reference implementation of this protocol: https://github.com/aaron-crl/toy-secure-init-handshake/tree/n-way-join One part of #60632. Ready for review, but these parts are still TODO: - [x] Unit tests, or at least manual test steps - [x] Address some of the TODOs, including rebasing on latest version of #60705 and only sending over the CAs in the trust bundle. Release note: None. Co-authored-by: Bilal Akhtar <[email protected]>

This change adds a new JoinToken struct in the server package for use in TLS auto-join (see RFC cockroachdb#51991). It also adds a feature flag to hide this feature except for opt-in use, as well as a generator function to the status server that can be used by a future SQL statement that would generate and return join tokens. The feature flag is also in the SQL package so it can be used by both SQL and server. Part of cockroachdb#60632. Release justification: All changes are to new code paths, and are gated behind a feature flag. Release note: None.

61505: server,sql: Add new JoinToken struct, generator, feature flag r=aaron-crl a=itsbilal This change adds a new JoinToken struct in the server package for use in TLS auto-join (see RFC #51991). It also adds a feature flag to hide this feature except for opt-in use, as well as a generator function to the status server that can be used by a future SQL statement that would generate and return join tokens. The feature flag is also in the SQL package so it can be used by both SQL and server. Part of #60632. WIP until: - [x] Unit tests for `server.JoinToken` Release justification: All changes are to new code paths, and are gated behind a feature flag. Release note: None. Co-authored-by: Bilal Akhtar <[email protected]>

Adds a CREATE JOINTOKEN statement for use in TLS auto-joins. This statement, when run on a self-hosted single-tenant Cockroach node, creates and returns a new join token. This join token can then be copy-pasted to new nodes and used to give them the set of certificates for secure auto TLS initialization. See RFC cockroachdb#51991. Part of cockroachdb#60632. Release justification: New code path, gated behind an experimental feature flag. Release note: None.

Adds a create_join_token builtin function for use in TLS auto-joins. This function, when run on a self-hosted single-tenant Cockroach node, creates and returns a new join token. This join token can then be copy-pasted to new nodes and used to give them the set of certificates for secure auto TLS initialization. See RFC cockroachdb#51991. Part of cockroachdb#60632. Release note: None.

62053: sql: Add system table join_tokens, and create_join_tokens() builtin function r=knz a=itsbilal This change adds a new system table, `join_tokens`, for the exclusive use of storing join tokens. This is necessary as we need guaranteed at-most-once semantics with these, which transactions give us pretty easily. A related migration is also added to create said table This change also adds a new builtin function, `crdb_internal.create_join_token()` that creates and persists a join token in that table. Currently, there's no mechanism to remove expired join tokens. See RFC #51991. Part of #60632. Release note (general change): Add `crdb_internal.create_join_token()` sql builtin function to create join tokens for use when joining new nodes to a secure cluster. This functionality is hidden behind a feature flag. Co-authored-by: Bilal Akhtar <[email protected]>

cockroach-teamcity · 2022-03-22T18:10:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Aaron Blum seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

first draft of cert-free secure setup RFC

fe2b632

aaron-crl added the do-not-merge bors won't merge a PR with this label. label Jul 28, 2020

aaron-crl requested review from bdarnell, knz and chrisseto July 28, 2020 13:19

aaron-crl requested a review from a team as a code owner July 28, 2020 13:19

bdarnell reviewed Jul 29, 2020

View reviewed changes

knz reviewed Jul 29, 2020

View reviewed changes

bdarnell mentioned this pull request Jul 29, 2020

Difference between Kubernetes Operator Secure Cluster Behavior and Our Docs cockroachdb/cockroach-operator#101

Closed

Updated to clarify FQDN requirement and address some of @knz 's feedb…

f2d6377

…ack"

Added explicit join-token consumption to "joining a cluster" step.

d1642ad

tbg requested review from bdarnell and knz August 4, 2020 13:23

tbg reviewed Aug 4, 2020

View reviewed changes

chrisseto reviewed Aug 4, 2020

View reviewed changes

tbg reviewed Aug 4, 2020

View reviewed changes

Updated language to remove hostname from join-token

90ce7d1

Stubbed out region for k8s notes

rafiss requested changes Aug 25, 2020

View reviewed changes

* Added a Goals section to clarify the purpose of this rfc.

03ff1ee

* Added potential solutions for initial startup goals to Guide-level explanation.

aaron-crl commented Aug 26, 2020

View reviewed changes

bdarnell mentioned this pull request Feb 25, 2021

TLS Certificate Generation cockroachdb/cockroach-operator#291

Closed

itsbilal mentioned this pull request Mar 4, 2021

server,sql: Add new JoinToken struct, generator, feature flag #61505

Merged

1 task

itsbilal mentioned this pull request Mar 10, 2021

sql: Add CREATE JOINTOKEN statement, gated behind feature flag #61810

Closed

itsbilal mentioned this pull request Mar 16, 2021

sql: Add system table join_tokens, and create_join_tokens() builtin function #62053

Merged

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

rafiss removed their request for review January 10, 2022 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: auto-generation of TLS certificates for new clusters and newly added nodes #51991

RFC: auto-generation of TLS certificates for new clusters and newly added nodes #51991

aaron-crl commented Jul 28, 2020 •

edited by knz

Loading

cockroach-teamcity commented Jul 28, 2020

bdarnell left a comment

knz left a comment

aaron-crl commented Jul 31, 2020

aaron-crl commented Jul 31, 2020

tbg left a comment

chrisseto left a comment

aaron-crl commented Aug 4, 2020

tbg left a comment

aaron-crl commented Aug 4, 2020

aaron-crl commented Aug 4, 2020

aaron-crl commented Aug 4, 2020

bdarnell commented Aug 5, 2020

knz commented Aug 25, 2020 •

edited

Loading

rafiss left a comment

aaron-crl left a comment

cockroach-teamcity commented Mar 22, 2022

RFC: auto-generation of TLS certificates for new clusters and newly added nodes #51991

Are you sure you want to change the base?

RFC: auto-generation of TLS certificates for new clusters and newly added nodes #51991

Conversation

aaron-crl commented Jul 28, 2020 • edited by knz Loading

cockroach-teamcity commented Jul 28, 2020

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

aaron-crl commented Jul 31, 2020

aaron-crl commented Jul 31, 2020

tbg left a comment

Choose a reason for hiding this comment

chrisseto left a comment

Choose a reason for hiding this comment

aaron-crl commented Aug 4, 2020

tbg left a comment

Choose a reason for hiding this comment

aaron-crl commented Aug 4, 2020

aaron-crl commented Aug 4, 2020

aaron-crl commented Aug 4, 2020

bdarnell commented Aug 5, 2020

knz commented Aug 25, 2020 • edited Loading

rafiss left a comment

Choose a reason for hiding this comment

aaron-crl left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 22, 2022

aaron-crl commented Jul 28, 2020 •

edited by knz

Loading

knz commented Aug 25, 2020 •

edited

Loading