-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client/Server TLS dynamic reload #3492
Conversation
070e632
to
e27c2db
Compare
e25aadb
to
f38e864
Compare
be7abb1
to
ed07520
Compare
392b066
to
576b595
Compare
576b595
to
b7d6488
Compare
command/agent/agent.go
Outdated
a.logger.Printf("[WARN] agent: Issue reloading the server's TLS Configuration, consider a full system restart: %v", err.Error()) | ||
return err, false | ||
} | ||
} else if c := a.Client(); c != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be running as both so don't use else if
use another if
b7d6488
to
ad970c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made one pass through
client/client.go
Outdated
@@ -363,6 +364,25 @@ func (c *Client) init() error { | |||
return nil | |||
} | |||
|
|||
// ReloadTLSConnectoins allows a client to reload RPC connections if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
client/client.go
Outdated
|
||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only place that can throw any errors here the OutgoingTLSWrapper method?
client/client_test.go
Outdated
QueryOptions: structs.QueryOptions{Region: "dc1"}, | ||
} | ||
var out structs.SingleNodeResponse | ||
testutil.AssertUntil(100*time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this trying to establish, looks like it always expects Node.GetNode to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes several checks until either the request succeeds or it hits the predetermined timeout.
if err != nil { | ||
return err | ||
} | ||
c.httpServer = http |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment seems unsafe without a mutex around it, what if two different calls to this reload method happen close enough in time in quick succession?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, fixing.
client/client.go
Outdated
c.config.TLSConfig = newConfig | ||
c.configLock.Unlock() | ||
|
||
if c.config.TLSConfig.EnableRPC { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only checking for enableRPC and not for enableHttp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP connections are handled at the HTTP server level
command/agent/command.go
Outdated
c.agent.logger.Printf("[ERR] agent: failed to reload http server: %v", err) | ||
return | ||
} | ||
c.agent.logger.Println("[INFO] agent: successfully restarted the HTTP server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks identical to the reloadHTTPServerOnConfigChange
method, and that method is unused, did you forget to make the replacement maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I did, fixed that up.
|
||
oldPool := p.pool | ||
for _, conn := range oldPool { | ||
conn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ignoring the error from conn.Close(), but so is the rest of this file. Need some context from @dadgar or @schmichael as to why close errors are ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know there's nothing useful to do with Close errors. In some circumstances logging them might be useful, but I can't imagine what actionable information errors from this Close could contain? I don't think it's even useful for catching programming errors like closing closed connections as there's always the possibility the connection may close concurrently due to peer disconnects or other issues.
If somebody has an example of a useful Close error I'd love to know it!
close(n.shutdownCh) | ||
n.shutdown = true | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an issue in the raft library and open a PR for this there and vendor the changes in. Otherwise, this will drift when we update raft.
3c50f91
to
1288b53
Compare
fix up linting
c6c6318
to
11089b2
Compare
@@ -472,6 +472,17 @@ NOTE: Dynamically reloading certificates will _not_ close existing connections. | |||
If you need to rotate certificates due to a security incident, you will still | |||
need to completely shutdown and restart the Nomad agent. | |||
|
|||
## Migrating a cluster to TLS | |||
|
|||
Nomad supports dynamically reloading a server to move from plaintext to TLS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nomad supports dynamically reloading it's TLS configuration. To reload Nomad's configuration, first update the configuration file and then send the Nomad agent a SIGHUP signal. Note that this will only reload a subset of the configuration file, including the TLS configuration.
When reloading the configuration, if there is a change to the TLS configuration, the agent will reload all network connections and when establishing new connections, will use the new configuration. This process
works for both upgrading and downgrading TLS (but we recommend upgrading).
client/client.go
Outdated
var tlsWrap tlsutil.RegionWrapper | ||
if newConfig != nil && newConfig.EnableRPC { | ||
tw, err := c.config.NewTLSConfiguration(newConfig).OutgoingTLSWrapper() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line here
client/client_test.go
Outdated
assert := assert.New(t) | ||
|
||
s1, addr := testServer(t, func(c *nomad.Config) { | ||
c.Region = "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the wrong region: https://github.com/hashicorp/nomad/pull/3721/files#diff-769638ff36c019b3e4c9eab992c23f60R190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further I don't think it is necessary to set it for this test
CertFile: foocert, | ||
KeyFile: fookey, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to assert the node registered using plaintext?
client/client_test.go
Outdated
defer c1.Shutdown() | ||
|
||
newConfig := &nconfig.TLSConfig{} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert it isn't registered
client/config/config.go
Outdated
// NewTLSConfiguration returns a TLSUtil Config for a new TLS config object | ||
// This allows a TLSConfig object to be created without first explicitly | ||
// setting it | ||
func (c *Config) NewTLSConfiguration(tlsConfig *config.TLSConfig) *tlsutil.Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this and put it back to how it was. Further if you wanted to have these separate, this one shouldn't be a method on the client Config. It should be func NewTLSConfiguration
command/agent/command.go
Outdated
err := c.agent.Reload(newConf) | ||
if err != nil { | ||
c.agent.logger.Printf("[ERR] agent: failed to reload the config: %v", err) | ||
shouldReload := c.agent.ShouldReload(newConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this maybe return a separate boolean for whether the http server should be reloaded? As we add stuff, you could potential reload the agent w/o needing to reload the HTTP server
|
||
func TestServer_Reload_TLS_WithNilConfiguration(t *testing.T) { | ||
t.Parallel() | ||
assert := assert.New(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving forward can you use https://godoc.org/github.com/stretchr/testify/require. It will hard fail on the first failed assertion
command/agent/http_test.go
Outdated
} | ||
|
||
s := makeHTTPServer(t, func(c *Config) { | ||
c.TLSConfig = newConfig.TLSConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have to set the region to regionFoo? I don't understand the test setup or how it is passing. The region is wrong and the server starts with TLS and it is being reloaded with the same config as it is originally started as
nomad/config.go
Outdated
// newTLSConfig returns a TLSUtil Config based on the server configuration | ||
// This is useful for creating a TLSConfig object without explictely setting it | ||
// on the config. | ||
func (c *Config) newTLSConfig(newConf *config.TLSConfig) *tlsutil.Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump, lets put this back to how it was
nomad/server.go
Outdated
s.listenerCh = make(chan struct{}) | ||
list, err := net.ListenTCP("tcp", s.config.RPCAddr) | ||
if err != nil || list == nil { | ||
s.logger.Printf("[ERR] nomad: No TLS listener to reload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the list == nil a valid case? Why not just check and log the error?
nomad/server.go
Outdated
|
||
if !newConfig.TLSConfig.Equals(s.config.TLSConfig) { | ||
if err := s.reloadTLSConnections(newConfig.TLSConfig); err != nil { | ||
s.logger.Printf("[DEBUG] nomad: reloading server TLS configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you wanted to log the error?
nomad/server.go
Outdated
@@ -776,6 +884,7 @@ func (s *Server) setupRPC(tlsWrap tlsutil.RegionWrapper) error { | |||
return err | |||
} | |||
s.rpcListener = list | |||
s.listenerCh = make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use createRPCListener? There is duplicated code
nomad/server.go
Outdated
tlsConf := s.config.newTLSConfig(newTLSConfig) | ||
incomingTLS, tlsWrap, err := getTLSConf(newTLSConfig.EnableRPC, tlsConf) | ||
if err != nil { | ||
s.logger.Printf("[ERR] nomad: unable to reset TLS context") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the errors. Can you search through the PR and make sure all your logs have the errors logged.
"unable to reset TLS context" is slightly vague. Can you make it more specific to the fact that it is creating the TLS config that failed.
nomad/server.go
Outdated
s.configLock.Unlock() | ||
|
||
if s.rpcCancel == nil { | ||
s.logger.Printf("[ERR] nomad: No TLS Context to reset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't about the context, it is there is no RPC servers. Structure so that the log reuses the returned error
nomad/server.go
Outdated
// CLose existing streams | ||
wrapper := tlsutil.RegionSpecificWrapper(s.config.Region, tlsWrap) | ||
s.raftLayer = NewRaftLayer(s.rpcAdvertise, wrapper) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.raftTransport.Reload(s.raftLayer)
?
Calls into questions the tests, if they are passing
nomad/server.go
Outdated
s.startRPCListener() | ||
|
||
time.Sleep(3 * time.Second) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not attached to the call site that returns the error
s.connPool.ReloadTLS(tlsWrap) | ||
|
||
// reinitialize our rpc listener | ||
s.rpcListener.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do all the RPC stuff and then (with a separate comment) start the Raft stuff
return trans | ||
} | ||
|
||
// Pause closes the current stream for a NetworkTransport instance | ||
func (n *NetworkTransport) Pause() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer these to be called CloseStream
and UseStream
and document that calling CloseStream
must be done before giving a new stream layer
// Accept incoming connections | ||
conn, err := n.stream.Accept() | ||
if err != nil { | ||
if n.IsShutdown() { | ||
return | ||
} | ||
// TODO Getting an error here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to check the ctx again as Accept is blocking
098f300
to
1dab7b5
Compare
close second goroutine in raft-net
1dab7b5
to
0805c41
Compare
a08ab45
to
231e9dd
Compare
nomad/server.go
Outdated
|
||
time.Sleep(3 * time.Second) | ||
time.Sleep(500 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
nomad/server_test.go
Outdated
assert.NotNil(out) | ||
assert.Equal(out.CreateIndex, resp.JobModifyIndex) | ||
for _, serv := range servers { | ||
testutil.WaitForResult(func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait for leader on positive cases
vendor/vendor.json
Outdated
{"path":"gopkg.in/tomb.v1","checksumSHA1":"TO8baX+t1Qs7EmOYth80MkbKzFo=","revision":"dd632973f1e7218eb1089048e0798ec9ae7dceb8","revisionTime":"2014-10-24T13:56:13Z"}, | ||
{"path":"gopkg.in/tomb.v2","checksumSHA1":"WiyCOMvfzRdymImAJ3ME6aoYUdM=","revision":"14b3d72120e8d10ea6e6b7f87f7175734b1faab8","revisionTime":"2014-06-26T14:46:23Z"}, | ||
{"path":"gopkg.in/yaml.v2","checksumSHA1":"12GqsW8PiRPnezDDy0v4brZrndM=","revision":"a5b47d31c556af34a302ce5d659e6fea44d90de0","revisionTime":"2016-09-28T15:37:09Z"} | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make vendorfmt
231e9dd
to
5170301
Compare
vendor/vendor.json
Outdated
@@ -162,7 +162,7 @@ | |||
{"path":"github.com/hashicorp/logutils","revision":"0dc08b1671f34c4250ce212759ebd880f743d883"}, | |||
{"path":"github.com/hashicorp/memberlist","checksumSHA1":"1zk7IeGClUqBo+Phsx89p7fQ/rQ=","revision":"23ad4b7d7b38496cd64c241dfd4c60b7794c254a","revisionTime":"2017-02-08T21:15:06Z"}, | |||
{"path":"github.com/hashicorp/net-rpc-msgpackrpc","revision":"a14192a58a694c123d8fe5481d4a4727d6ae82f3"}, | |||
{"path":"github.com/hashicorp/raft","checksumSHA1":"ecpaHOImbL/NaivWrUDUUe5461E=","revision":"3a6f3bdfe4fc69e300c6d122b1a92051af6f0b95","revisionTime":"2017-08-07T22:22:24Z"}, | |||
{"path":"github.com/hashicorp/raft","checksumSHA1":"zkA9uvbj1BdlveyqXpVTh1N6ers=","revision":"077966dbc90f342107eb723ec52fdb0463ec789b","revisionTime":"2018-01-17T20:29:25Z","version":"=master","versionExact":"master"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the version be just "master" not "=master"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point- this was from what I thought the govendor syntax needed to be. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to changelog please
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR includes: