Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] allow server cert rotation without a restart #1672

Closed
wants to merge 4 commits into from

Conversation

krasi-georgiev
Copy link
Contributor

@krasi-georgiev krasi-georgiev commented Oct 21, 2019

I did a quick test and reading the file from disk on every request is ok sinse the kernel puts this in the pagecache so in practice it reads the file from memory when not changed and reads it from disk when it is replaced with a new one.

related modules that use the same idea.
https://godoc.org/golang.org/x/crypto/acme/autocert
https://github.com/johanbrandhorst/certify

I did a quick test and reading the same file from disk is not a problem as the kernel caches it in the memory when the file is not changed.

Test for reading the same file again doesn't hit the disk
func Test_io(t *testing.T) {
	pid := os.Getpid()
	for {
		_, err := ioutil.ReadFile("env")
		testutil.Ok(t, err)

		io, err := ioutil.ReadFile("/proc/" + strconv.Itoa(pid) + "/io")
		testutil.Ok(t, err)

		fmt.Println(string(io))
		time.Sleep(500 * time.Millisecond)
	}
}

Copy link
Member

@povilasv povilasv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@brancz
Copy link
Member

brancz commented Oct 22, 2019

This reads and parses the certificates on every request. I feel it would be appropriate to do this periodically instead and atomically swap them.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Oct 22, 2019

This reads and parses the certificates on every request. I feel it would be appropriate to do this periodically instead and atomically swap them.

Yep, and it was the second item on my to do list in the PR description 😸

refactor to reduce the performance overhead

I already refactored it to compare mod times and reload only when different, but will think how to improve even more.
I think this is only important with the receiver which receives many requests per second all other components should be fine even with this approach right?
I am thinking to add some delay arg which would be used only by the receiver.

by the way currently the Prometheus scrape client does this with every request which I agree should be improved.
https://github.com/prometheus/common/blob/2afc453a9de43c46dadd462998d096e52cd4e4e3/config/http_config.go#L348

@brancz
Copy link
Member

brancz commented Oct 23, 2019

Sorry was too hasty. All good :)

pkg/tls/tls.go Outdated
return nil, errors.Wrap(err, "building client CA certificate")
}

// ????? test if true...... This is a pointer it will update the underlying tls config pool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is still a work in progress because of these comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep one of the items in my todo list

test if the CA pool gets updated with the ponter

@bwplotka
Copy link
Member

bwplotka commented Oct 29, 2019

Do we have a story for compatible rotation? e.g how this will work between server and client? (: e.g does the rotation has to happen in the same time for both srv and client?

Copy link
Contributor Author

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a story for compatible rotation? e.g how this will work between server and client? (: e.g does the rotation has to happen in the same time for both srv and client?

good point. I think Thanos should only be responsible for rotating the certs when new ones are presented the rest should be handled by the config management.

All that said at the moment the setup allows a single CA and maybe it should be changed to allow multiple CA files to allow an easier rotation.

@brancz
Copy link
Member

brancz commented Oct 29, 2019

Even if a single file, CAs can usually be loaded as bundles so in reality there are multiple CAs in one file.

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Nov 1, 2019

now after writing the tests I realized that there is another problem.
getConfigForClient and getCertificate are called only with the tls handshake and NOT when sending or receiving data which means that for long running data flow it will never reload the certs as it will reuse the same connection and no new tls handshake will happen.
In this PR I have also added a cli flag so users can set the keep alive setting with MaxConnectionAge which enforces a new TLS hanshake when MaxConnectionAge expires.
New tls handshake is initiated when the connection is interrupted or when the MaxConnectionAge expires.

So with this new info I think there are 3 options:

  1. keep the current implementation with a cli flag for MaxConnectionAge
  2. run some loop which will restart the server gracefully or drop all connections when it detects tls cert files changes.
  3. provide a reload endpoint and leave reloading to external config management.

Again I think this is important mainly for the thanos receiver as I imagine that with all other components a new connection and a new tls handshake would be created with every request so it will reload the certs as expected.

cc @bwplotka, @s-urbaniak

@krasi-georgiev
Copy link
Contributor Author

krasi-georgiev commented Nov 7, 2019

@squat , @bwplotka @s-urbaniak waiting for some feedback on which options do you think is best for this implementation.

@s-urbaniak
Copy link
Contributor

@krasi-georgiev Sorry for the late reply. I am voting for option "1. keep the current implementation with a cli flag for MaxConnectionAge"

I believe this is a sensible setting and less disruptive than the other options. Also, it does not require external coordination (someone has to implement that timeout).

Finally cert rotation happens in timespans of minutes, maybe even hours so a MaxConnectionAge in that timespan sounds sensible to me too.

@bwplotka @brancz Do you have additional thoughts or objections?

@bwplotka
Copy link
Member

the CA rotation works only for the client CAs and not for the server CAs. Missing GetServerConfig as mentioned in golang/go#16066 (comment)

How did we work around this one?

So I think MaxConnectionAge here might be too short by default (1m), as kind of defies any KeepAlive logic.. but maybe something longer would make sense. Also something I would like to check if MaxConnectionAge is actually accounted for during long-running requests, although I think we don't do those longer than X minutes.

What's the requirement for those cert rotations in terms of delay? I think ~minutes is ok, so maybe we could start with this, but larger MaxConnectionAge

@s-urbaniak
Copy link
Contributor

What's the requirement for those cert rotations in terms of delay? I think ~minutes is ok, so maybe we could start with this, but larger MaxConnectionAge

agreed, the ~minutes ballpark sounds good to me.

Signed-off-by: Krasi Georgiev <[email protected]>
Signed-off-by: Krasi Georgiev <[email protected]>
@krasi-georgiev
Copy link
Contributor Author

After a bit of a struggle the tls rotation is implemented with full e2e tests.

As mentioned in the first PR comment server CA rotation is missing becasue it is yet not added in the golang std library so I left a TODO note to add it when it is added in the golang std library. I don't think that this is a blocker for this PR and I will follow up the tracking golang issues for updates.

@squat I also noticed that there are no e2e tests for the reciever TLS, should I add it to this PR or you prefer to open another PR after this one is merged?

Signed-off-by: Krasi Georgiev <[email protected]>
Copy link
Member

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good so far!

pkg/tls/tls.go Outdated Show resolved Hide resolved
pkg/tls/tls.go Show resolved Hide resolved
test/e2e/tls_test.go Show resolved Hide resolved
test/e2e/tls_test.go Outdated Show resolved Hide resolved
pkg/tls/tls.go Outdated Show resolved Hide resolved
pkg/tls/tls.go Outdated Show resolved Hide resolved
Signed-off-by: Krasi Georgiev <[email protected]>
@krasi-georgiev
Copy link
Contributor Author

@squat thanks for the review, addressed all comments.

didn't answer my original question though 😺

@squat I also noticed that there are no e2e tests for the reciever TLS, should I add it to this PR or you prefer to open another PR after this one is merged?

@squat
Copy link
Member

squat commented Dec 9, 2019

let's do it in a new PR to keep the concerns separate :) I can make the PR

@krasi-georgiev
Copy link
Contributor Author

@brancz @bwplotka any other comments for this PR?

@krasi-georgiev
Copy link
Contributor Author

ping @bwplotka

@GiedriusS
Copy link
Member

@krasi-georgiev are there any numbers on the performance impact of the current implementation? 👀 does Windows exhibit the same behavior with caching?

@krasi-georgiev
Copy link
Contributor Author

@krasi-georgiev are there any numbers on the performance impact of the current implementation?

probably, but only for single syscall to get the ModTime() so I would say negligible. If anyone wants to disable this can set a high enough --grpc-server-max-connection-age so that the connection is persistent and it never checks for new certs.

does Windows exhibit the same behavior with caching?

I would say yes, but is windows even suported by Thanos?
Last time I tried to run the tests under windoes these failed in many places.

@krasi-georgiev krasi-georgiev changed the title allow server cert rotation without a restart [WIP] allow server cert rotation without a restart Feb 12, 2020
@xunleii
Copy link

xunleii commented Mar 5, 2020

Thanks for this PR, it is awesome to automatically reload certificates when they are modified.

It works perfectly on Kubernetes with certificates automatically renewed by Cert-Manager 👍
I have tested this PR with a Thanos querier and a Thanos sidecar, using DNS SRV discovery, GRPC TLS and client auth by certiifcates, with all certificates renewed each 5 minutes.

EDIT: Sorry guys, I edit this comment because of huge pebcak on my side ... I have tested this PR and it seems working perfectly !

@stale
Copy link

stale bot commented Apr 4, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Apr 4, 2020
@bwplotka
Copy link
Member

bwplotka commented Apr 5, 2020 via email

@stale stale bot removed the stale label Apr 5, 2020
@stale
Copy link

stale bot commented Jun 4, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Jun 4, 2020
@bwplotka
Copy link
Member

bwplotka commented Jun 4, 2020 via email

@stale stale bot removed the stale label Jun 4, 2020
@stale
Copy link

stale bot commented Aug 3, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Aug 3, 2020
@bwplotka
Copy link
Member

bwplotka commented Aug 3, 2020 via email

@stale stale bot removed the stale label Aug 3, 2020
@stale
Copy link

stale bot commented Oct 2, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Oct 2, 2020
@bwplotka
Copy link
Member

bwplotka commented Oct 2, 2020 via email

@stale stale bot removed the stale label Oct 2, 2020
@krasi-georgiev
Copy link
Contributor Author

stale, someone else needs to take over.

@Tachone
Copy link

Tachone commented Jun 10, 2022

now after writing the tests I realized that there is another problem. getConfigForClient and getCertificate are called only with the tls handshake and NOT when sending or receiving data which means that for long running data flow it will never reload the certs as it will reuse the same connection and no new tls handshake will happen. In this PR I have also added a cli flag so users can set the keep alive setting with MaxConnectionAge which enforces a new TLS hanshake when MaxConnectionAge expires. New tls handshake is initiated when the connection is interrupted or when the MaxConnectionAge expires.

So with this new info I think there are 3 options:

  1. keep the current implementation with a cli flag for MaxConnectionAge
  2. run some loop which will restart the server gracefully or drop all connections when it detects tls cert files changes.
  3. provide a reload endpoint and leave reloading to external config management.

Again I think this is important mainly for the thanos receiver as I imagine that with all other components a new connection and a new tls handshake would be created with every request so it will reload the certs as expected.

cc @bwplotka, @s-urbaniak

Why do we need to set MaxConnectionAge, Is there any problem with the long connection not being closed? the old certificate is no longer used on the established connection, am i wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants