Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Implement auto-init handshake to negotiate CA certs between nodes #60766

Merged
merged 1 commit into from
Feb 24, 2021

Conversation

itsbilal
Copy link
Contributor

@itsbilal itsbilal commented Feb 19, 2021

This PR implements the init protocol phase of the cert-free setup
described in #51991. A lot of the code is pulled out of Aaron's
reference implementation of this protocol:
https://github.com/aaron-crl/toy-secure-init-handshake/tree/n-way-join

One part of #60632.

Ready for review, but these parts are still TODO:

Release note: None.

@itsbilal itsbilal requested a review from a team as a code owner February 19, 2021 02:36
@itsbilal itsbilal self-assigned this Feb 19, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, and @knz)


pkg/server/init_handshake.go, line 44 at r5 (raw file):

// internode trust delivery struct
type nodeTrustBundle struct {

I believe the RFC calls this the "initialization-bundle". Would it be useful to use the same name here somewhere or reference it in the comment?


pkg/server/init_handshake.go, line 143 at r5 (raw file):

		Certificates: []tls.Certificate{serverCert},
		RootCAs:      certpool,
	}

It appears that in pkg/security/tls.go we have a newBaseTLSConfig. Might it be worth making some portion of that public so that all TLS servers share the same basic configuration (specifically around cipher suites, minimum TLS version, etc)?


pkg/server/init_handshake.go, line 169 at r5 (raw file):

	if !challenge.validHMAC(t.token) {
		http.Error(res, "invalid message", http.StatusBadRequest)

This error is different from the previous error return. It might make debugging a bit harder, but would it be prudent to make sure all of the externally emitted errors from this process return something like http.StatusUnauthorized to prevent any potential information leakage that could be valuable to an attacker? Or at least all errors returned at and before this point in the function.


pkg/server/init_handshake.go, line 230 at r5 (raw file):

		}

		peerHostnameAndCa, err := t.getPeerCACert(peerHostname, selfAddress)

Do we need logging around errors here? Or rather, how would connectivity issues between the nodes surface to the user?


pkg/server/init_handshake.go, line 364 at r5 (raw file):
The RFC says:

Once this bundle is created, this node uses the public key for each peer's CA to create strong TLS connections to each peer using their trusted CA and deliver the initialization-bundle in addition to a MAC of the bundle generated with the initialization-token to prove authenticity.

Here, our MAC covers just the CACertificate and CAKey rather than the entire bundle. I haven't thought through whether that is sufficient. If it is sufficient, we should update the RFC. If it isn't, we should update this code.

It might be nice to have a function on nodeTrustBundle{} that computes the HMAC so we could use it both here and at validTrustBundle().


pkg/server/init_handshake.go, line 375 at r5 (raw file):

	}
	select {
	case <-handshaker.finishedInit:

What writes to finishedInit in the case of the trustLeader? I'm not seeing it, but perhaps I missed it.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from 314a17c to 2c3c8e8 Compare February 19, 2021 20:52
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Steven! Addressed most comments, and rebased on the latest iteration of the base branch. I'll add the tests shortly.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 44 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I believe the RFC calls this the "initialization-bundle". Would it be useful to use the same name here somewhere or reference it in the comment?

Done. Referenced it in the comment.


pkg/server/init_handshake.go, line 169 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

This error is different from the previous error return. It might make debugging a bit harder, but would it be prudent to make sure all of the externally emitted errors from this process return something like http.StatusUnauthorized to prevent any potential information leakage that could be valuable to an attacker? Or at least all errors returned at and before this point in the function.

I wanted to treat the previous one as an internal error, and this one as a user error, but now that I think about it, both are malformed requests so I'll return a 400 bad request and the same message for both of them.


pkg/server/init_handshake.go, line 230 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Do we need logging around errors here? Or rather, how would connectivity issues between the nodes surface to the user?

Good point. Added a log message. Currently it'd just wait forever until the timeout, which is way more opaque than it should be.


pkg/server/init_handshake.go, line 364 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

The RFC says:

Once this bundle is created, this node uses the public key for each peer's CA to create strong TLS connections to each peer using their trusted CA and deliver the initialization-bundle in addition to a MAC of the bundle generated with the initialization-token to prove authenticity.

Here, our MAC covers just the CACertificate and CAKey rather than the entire bundle. I haven't thought through whether that is sufficient. If it is sufficient, we should update the RFC. If it isn't, we should update this code.

It might be nice to have a function on nodeTrustBundle{} that computes the HMAC so we could use it both here and at validTrustBundle().

Done. Also did the helper. Thanks for catching that discrepancy.


pkg/server/init_handshake.go, line 375 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

What writes to finishedInit in the case of the trustLeader? I'm not seeing it, but perhaps I missed it.

Done. Good catch, I meant to add this under an "else" above, but missed it.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from 2c3c8e8 to cf19108 Compare February 19, 2021 22:54
@itsbilal
Copy link
Contributor Author

Also added a unit test for the simple / correct case. Will expand it to include cases where the tokens don't match.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from cf19108 to 2ae7b43 Compare February 19, 2021 23:32
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. You'll want to rebase to pick up the latest changes from #60705.

Reviewed 2 of 4 files at r6, 4 of 4 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, and @stevendanna)


pkg/server/init_handshake.go, line 38 at r7 (raw file):

var errInvalidHMAC = errors.New("invalid HMAC signature")

// Blob used for symetric exchange of node public CA keys.

nit: symmetric


pkg/server/init_handshake.go, line 195 at r7 (raw file):

	var challenge nodeHostnameAndCA

	// TODO(aaron-crl): [Security] make this more error resilient to size and shape attacks

nit: period at end of sentence.


pkg/server/init_handshake.go, line 208 at r7 (raw file):

	}

	// send the valid node infomation to the trustedPeers channel

nit: here and below full sentences in comment.


pkg/server/init_handshake.go, line 257 at r7 (raw file):

func (t *tlsInitHandshaker) getPeerCACert(peerAddress string, selfAddress string) (nodeHostnameAndCA, error) {
	// TODO(aaron-crl): [Enhancement] add TLS protocol level checks to make sure remote certificate matches preferred one

nit: period at end of sentence.


pkg/server/init_handshake.go, line 273 at r7 (raw file):

	var body bytes.Buffer
	_ = json.NewEncoder(&body).Encode(challenge)
	res, err := client.Post("https://"+peerAddress+"/trustInit/", "application/json; charset=utf-8", &body)

Here and below: extract the URL paths in global constants. Also use the same constants in the server URL routing logic.

@knz
Copy link
Contributor

knz commented Feb 21, 2021

cc @aaron-crl

As I started working on the next bit of server code, I encountered a design mismatch in the interface of server.InitHandshake. Currently this conflates the number of expected peers with the size of the peer array.

This a problem: in practice, folk with orchestration can pass the list of peers via a wildcard address. Also we may not have symmetric join lists. Also we want eventually to support resolution via SRV records.

I think that semantically, the only thing that matters is that eventually there are N nodes connected, where N is the number of expected peers. If one of the outgoing connect() calls fails, it's not a problem - maybe there were more peer addresses specified than expected. If there are fewer peer addresses than expected peers, that is also OK - perhaps there will be peers connecting back to the node instead.

So I think it would be better to pass the number of expected peers separately from the array of peer addresses.

@knz
Copy link
Contributor

knz commented Feb 21, 2021

Regarding this last point, I added a commit in #60854 to do the interface change.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: the new test TestInitHandshake fails when rebased on top of the latest #60705 however the error indicates why:

--- FAIL: TestInitHandshake (0.38s)
    test_log_scope.go:73: test logs captured to: /tmp/logTestInitHandshake534579817
    test_log_scope.go:74: use -show-logs to present logs inline
    init_handshake_test.go:68:
                Error Trace:    init_handshake_test.go:68
                Error:          Received unexpected error:
                                failed to create certificates: failed to parse valid Certificate from PEM blob: x509: failed to parse private key (use ParsePKCS8PrivateKey instead for this key format)
                                (1) attached stack trace
                                  -- stack trace:
                                  | github.com/cockroachdb/cockroach/pkg/server.InitHandshake
                                  |     /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/init_handshake.go:365
                                  | [...repeated from below...]
                                Wraps: (2) failed to create certificates
                                Wraps: (3) attached stack trace
                                  -- stack trace:
                                  | github.com/cockroachdb/cockroach/pkg/security.CreateServiceCertAndKey
                                  |     /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/security/auto_tls_init.go:155
                                  | github.com/cockroachdb/cockroach/pkg/server.createNodeInitTempCertificates
                                  |     /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/init_handshake.go:134
                                  | github.com/cockroachdb/cockroach/pkg/server.InitHandshake
                                  |     /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/init_handshake.go:363
                                  | github.com/cockroachdb/cockroach/pkg/server.TestInitHandshake.func1
                                  |     /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/init_handshake_test.go:55
                                  | runtime.goexit
                                  |     /usr/local/go/src/runtime/asm_amd64.s:1371
                                Wraps: (4) failed to parse valid Certificate from PEM blob
                                Wraps: (5) x509: failed to parse private key (use ParsePKCS8PrivateKey instead for this key format)
                                Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString
                Test:           TestInitHandshake
    panic.go:613: -- test log scope end --

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, and @stevendanna)

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from 2ae7b43 to ee4f352 Compare February 22, 2021 16:15
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that change in your PR! I'll leave the interface unchanged to make it easier for you. And that test seems to be working for me, even post-rebase. Trick is to borrow all of Aaron's changes in pkg/security and ditch any that I made in this branch, as they were just a temporary hack. Latest push to this PR has that cleaned up and rebased on master.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 143 at r5 (raw file):

Previously, stevendanna (Steven Danna) wrote…

It appears that in pkg/security/tls.go we have a newBaseTLSConfig. Might it be worth making some portion of that public so that all TLS servers share the same basic configuration (specifically around cipher suites, minimum TLS version, etc)?

I can leave this as a TODO for now. It'll require a bit of a refactor of things there.


pkg/server/init_handshake.go, line 273 at r7 (raw file):

Previously, knz (kena) wrote…

Here and below: extract the URL paths in global constants. Also use the same constants in the server URL routing logic.

Done.

Copy link

@aaron-crl aaron-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you might not have cleanly dropped all changes against pkg/security. Maybe you missed a commit?

Also, some of the code copied from my reference implementation has poor comment hygiene (missing capital letter at start and punctuation at the end). Let me know if you want me to toss you a PR fixing that after the rebase is cleaned up.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, and @stevendanna)


pkg/security/auto_tls_init.go, line 36 at r8 (raw file):

// createCertificateSerialNumber is a helper function that generates a
// random value between 1 and 2^31.

It looks like here and throughout the file this has reverted the pkg/security/auto_tls_init.go changes against master.


pkg/server/init_handshake.go, line 273 at r7 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done.

Am I misunderstanding or did this change not stick?

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from ee4f352 to 44f5da2 Compare February 22, 2021 17:23
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad - I didn't drop everything in pkg/security. Fixing.

All the comments should be already fixed though. Let me know if I missed some. Also added the mismatching-token case, so I'll drop the WIP moniker from this PR now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @knz, and @stevendanna)


pkg/security/auto_tls_init.go, line 36 at r8 (raw file):

Previously, aaron-crl wrote…

It looks like here and throughout the file this has reverted the pkg/security/auto_tls_init.go changes against master.

Done.


pkg/server/init_handshake.go, line 273 at r7 (raw file):

Previously, aaron-crl wrote…

Am I misunderstanding or did this change not stick?

Yep, my bad. Juggling too many changes here right now.

@itsbilal itsbilal changed the title [WIP] server: Implement auto-init handshake to negotiate CA certs between nodes server: Implement auto-init handshake to negotiate CA certs between nodes Feb 22, 2021
@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch 3 times, most recently from 4c22f32 to 4bb6f5d Compare February 22, 2021 21:31
@itsbilal
Copy link
Contributor Author

Latest push swaps out serverListenAddr and serverListenPort for a Listener. Seems like this was necessary to make tests work; TeamCity wasn't liking tests listening on a fixed host/port.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch 2 times, most recently from be20b06 to 30ecbe1 Compare February 22, 2021 22:11
@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch 6 times, most recently from eb3e7b9 to b653fd9 Compare February 23, 2021 17:17
@itsbilal
Copy link
Contributor Author

Sorry for the repeat force-pushes - I was able to isolate the CI failure to just be significantly slower cert generation on teamcity than elsewhere. I'll bump up timeouts to account for that.

I'll ping this PR again when it's ready to review and I've addressed all the open comments.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch 3 times, most recently from 5181916 to d29fede Compare February 23, 2021 19:04
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed almost all the comments - this is ready for another look. Thanks for the reviews! I'm also very close to a CI green - the big blockers are now fixed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 401 at r9 (raw file):

Previously, aaron-crl wrote…

nit: Comment should be a complete sentence.

Done. Thanks!


pkg/server/init_handshake.go, line 254 at r10 (raw file):

Previously, aaron-crl wrote…

I think you can use InitializeNodeFromBundle for this now.

Done.


pkg/server/init_handshake.go, line 467 at r10 (raw file):

Previously, aaron-crl wrote…

This is one of the places where a locked cluster may manifest where a peer has restarted or been replaced. If the TLS handshake fails it means that the original node is no longer at that address or there's a man-in-the-middle. We should definitely elevate this error.

So if I'm understanding this correctly, we shouldn't repeatedly retry sending the bundle but should fail fast if it's a TLS handshake error, right?


pkg/server/init_handshake.go, line 114 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[nit] Just to avoid having to keep these lists in sync, perhaps something like:

// Computes the HMAC signature based on a signature of all CA
// certificate/keys in the Bundle.
func (n *nodeTrustBundle) computeHMAC(secretToken []byte) []byte {
	h := hmac.New(sha256.New, secretToken)
	h.Write(n.Bundle.InterNode.CACertificate)
	h.Write(n.Bundle.InterNode.CAKey)
	h.Write(n.Bundle.UserAuth.CACertificate)
	h.Write(n.Bundle.UserAuth.CAKey)
	h.Write(n.Bundle.SQLService.CACertificate)
	h.Write(n.Bundle.SQLService.CAKey)
	h.Write(n.Bundle.RPCService.CACertificate)
	h.Write(n.Bundle.RPCService.CAKey)
	h.Write(n.Bundle.AdminUIService.CACertificate)
	h.Write(n.Bundle.AdminUIService.CAKey)
	return h.Sum(nil)
}

// Sets the HMAC signature field in n based on a signature of all CA
// certificate/keys in the Bundle. Converse of validHMAC.
func (n *nodeTrustBundle) signHMAC(secretToken []byte) {
	n.HMAC = n.computeHMAC(secretToken)
}


// Verifies the HMAC signature field set in n to match that of the bundled
// CA certificates and fields. Converse of signHMAC.
func (n *nodeTrustBundle) validHMAC(secretToken []byte) bool {
        expectedMac := n.computeHMAC(secretToken)
	return hmac.Equal(expectedMac, n.HMAC)
}

Done. Thanks for writing this out!


pkg/server/init_handshake.go, line 117 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

[question] I know the named returns can be nice as a sort of documentation of the return signature, but since we are using explicit returns now, might it be good to avoid the named return arguments here to prevent any unexpected problems in the future? What is our usual code style here?

I've seen the named returns in other places as just a marker for what the return values are, but it's definitely less common if it's not being used. I've just ditched the names as it's pretty obvious here (from the function name). Thanks!


pkg/server/init_handshake.go, line 336 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Whose reading this channel?

It's read down in InitHandshake where we also select on trustedPeers.


pkg/server/init_handshake.go, line 374 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I imagine it would be rare, but any reason not to handle the error here?

Also, I wonder if it would be worth it here to construct a bytes.Reader from this buffer so that we could do this encoding once outside the loop and leek to the beginning on retry.

Done. Thanks for suggesting this!


pkg/server/init_handshake.go, line 410 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

As far as I can tell, listenPort and listenAddr/listenHost are only ever used to construct an address in the form of host:port. I wonder if it would make sense for this function to take a string listenerAddr directly or call listener.Addr().String() and use what it returns. I see that this change was made for testing, but I think either of these strategies could work for testing.

I mention this because I'm a bit concerned that constructing addresses with Sprintf("%s:%d", host, port) won't work well in IPv6 environments when the user has given us IP addresses rather than hostnames.

Done. I changed this to using listenAddr. The only place that just needs a host (no port) is createNodeInitialTempCertificates, so that part still exists, but everywhere else it's addr.String().


pkg/server/init_handshake_test.go, line 67 at r11 (raw file):

Previously, stevendanna (Steven Danna) wrote…

You could use require.NoError here as you did above. Similarly for the next few error handling cases below.

Done.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from d29fede to 053a97d Compare February 23, 2021 19:37
Copy link

@aaron-crl aaron-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had another pass through; left a few organizational thoughts and one security enhancement request. If you don't want to address that in this PR we can pick it up in another one.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 467 at r10 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

So if I'm understanding this correctly, we shouldn't repeatedly retry sending the bundle but should fail fast if it's a TLS handshake error, right?

I think so. If we're trying to provision a bundle to a known host with a known cert chain and we get a different cert, either the node we thought we had for quorum is dead and restarted with new certs or we're no longer talking to the original node. The former is unrecoverable and the latter is outside the provisioners ability to define.


pkg/server/init_handshake.go, line 114 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. Thanks for writing this out!

Very nice suggestion @stevendanna !


pkg/server/init_handshake.go, line 36 at r12 (raw file):

const (
	initServiceName     = "temp-init-service"
	defaultInitLifespan = 5 * time.Minute

Do we enable setting this anywhere? Five minutes might be a bit tight outside of trivial deployments; I'd default to at least 10 minutes here regardless.


pkg/server/init_handshake.go, line 205 at r12 (raw file):

	var challenge nodeHostnameAndCA

	// TODO(aaron-crl): [Security] Make this more error resilient to size and shape attacks.

I know this was copied from the reference implementation but it should probably be addressed here.

The issue is that an attacker could create an extremely large JSON blob and pass it to the server to eat CPU and memory. Since we know the tokens shape and size we should restrict both on read with something like this:

req.Body = http.MaxBytesReader(w, req.Body, 4096)

dec := json.NewDecoder(req.Body)
dec.DisallowUnknownFields()
err := dec.Decode(&challenge)
if err != nil {
    // Invalid message received, return BadRequestError.
    ...
}

h/t: to https://www.alexedwards.net/blog/how-to-properly-parse-a-json-request-body


pkg/server/init_handshake.go, line 368 at r12 (raw file):

	rootCAs.AppendCertsFromPEM(peerCACert)

	client := t.getClient(false /* insecureSkipVerify */, rootCAs /* rootCAs */)

This feels dangerous and error prone. Might be better to move the insecure config call to an explicitly named insecure function or instantiate the insecure config at it's only call site above and use the pkg/security config elsewhere.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from 053a97d to 12862b5 Compare February 23, 2021 20:53
Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 36 at r12 (raw file):

Previously, aaron-crl wrote…

Do we enable setting this anywhere? Five minutes might be a bit tight outside of trivial deployments; I'd default to at least 10 minutes here regardless.

Done. It's 10 mins now. It's currently not configurable at all; open to thoughts on how we can make it so.


pkg/server/init_handshake.go, line 205 at r12 (raw file):

Previously, aaron-crl wrote…

I know this was copied from the reference implementation but it should probably be addressed here.

The issue is that an attacker could create an extremely large JSON blob and pass it to the server to eat CPU and memory. Since we know the tokens shape and size we should restrict both on read with something like this:

req.Body = http.MaxBytesReader(w, req.Body, 4096)

dec := json.NewDecoder(req.Body)
dec.DisallowUnknownFields()
err := dec.Decode(&challenge)
if err != nil {
    // Invalid message received, return BadRequestError.
    ...
}

h/t: to https://www.alexedwards.net/blog/how-to-properly-parse-a-json-request-body

I'll leave this for a future PR if that's fine? This PR has gotten convoluted enough already.

Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Reviewable ate up some of my comments. Adding the rest

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 467 at r10 (raw file):

Previously, aaron-crl wrote…

I think so. If we're trying to provision a bundle to a known host with a known cert chain and we get a different cert, either the node we thought we had for quorum is dead and restarted with new certs or we're no longer talking to the original node. The former is unrecoverable and the latter is outside the provisioners ability to define.

That makes sense. Do you happen to know what the TLS error would be (handshake error, connection reset, unrecognized root CA) , so I can search for it and fail fast?

Currently, the code would hang until the context is cancelled due to timeout, but that's less ideal and less secure.


pkg/server/init_handshake.go, line 368 at r12 (raw file):

Previously, aaron-crl wrote…

This feels dangerous and error prone. Might be better to move the insecure config call to an explicitly named insecure function or instantiate the insecure config at it's only call site above and use the pkg/security config elsewhere.

Done.

Copy link

@aaron-crl aaron-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo those two TODO's and TLS error check this :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 467 at r10 (raw file):

Do you happen to know what the TLS error would be
Not off the top of my head though unrecognized root CA would probably be the best indication of a replaced node since it would still have the same hostname but not the same CA.

You could test for that and leave a TODO to pick up after with our testing (when we generate a test for this case).


pkg/server/init_handshake.go, line 36 at r12 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. It's 10 mins now. It's currently not configurable at all; open to thoughts on how we can make it so.

Probably worth it's own CLI flag. I can see this being tuned per environment. Let's drop a TODO to pull it from the Config and pick it up from there.


pkg/server/init_handshake.go, line 205 at r12 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I'll leave this for a future PR if that's fine? This PR has gotten convoluted enough already.

I can pick it up in my next PR.

Copy link
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! Added the two TODOs

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, @knz, and @stevendanna)


pkg/server/init_handshake.go, line 467 at r10 (raw file):

Previously, aaron-crl wrote…

Do you happen to know what the TLS error would be
Not off the top of my head though unrecognized root CA would probably be the best indication of a replaced node since it would still have the same hostname but not the same CA.

You could test for that and leave a TODO to pick up after with our testing (when we generate a test for this case).

Done.


pkg/server/init_handshake.go, line 36 at r12 (raw file):

Previously, aaron-crl wrote…

Probably worth it's own CLI flag. I can see this being tuned per environment. Let's drop a TODO to pull it from the Config and pick it up from there.

Done.

@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch 2 times, most recently from c650873 to 9a9eea8 Compare February 23, 2021 21:49
…odes

This PR implements the init protocol phase of the cert-free setup
described in cockroachdb#51991. A lot of the code is pulled out of Aaron's
reference implementation of this protocol:
https://github.com/aaron-crl/toy-secure-init-handshake/tree/n-way-join

One part of cockroachdb#60632.

Release note: None.
@itsbilal itsbilal force-pushed the add-tls-auto-init-handshake branch from 9a9eea8 to 6f92fe9 Compare February 23, 2021 22:19
@itsbilal
Copy link
Contributor Author

TFTRs!

bors r=aaron-crl

@craig
Copy link
Contributor

craig bot commented Feb 24, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 24, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 24, 2021

Build succeeded:

@craig craig bot merged commit ec01162 into cockroachdb:master Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants