-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[agent/node] Bail on cert error when joining #2206
[agent/node] Bail on cert error when joining #2206
Conversation
8884a3c
to
96da955
Compare
96da955
to
1eeea15
Compare
Codecov Report
@@ Coverage Diff @@
## master #2206 +/- ##
==========================================
+ Coverage 60.03% 60.17% +0.13%
==========================================
Files 124 124
Lines 20115 20143 +28
==========================================
+ Hits 12077 12121 +44
+ Misses 6680 6661 -19
- Partials 1358 1361 +3 |
1eeea15
to
5e42e09
Compare
I also want to test that if |
I believe we pass a if n.config.JoinAddr != "" {
n.remotes.Observe(api.Peer{Addr: n.config.JoinAddr}, remotes.DefaultObservationWeight)
} I had the idea of making an alternate implementation of So scratch that. I guess the approach taken here makes sense. I'm not crazy about the extra complexity but I can't think of anything better. I'd just suggest removing |
@aaronlehmann Agree - if we could fail on any kind of error on first join, doing the check at the spot you point out I think would be a good idea. I'm also not super happy with how this works but I couldn't think of anything else for now. Have removed |
5e42e09
to
930f38b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with all the above comments. Added complexity isn't good, but this is needed.
agent/config.go
Outdated
@@ -43,6 +43,10 @@ type Config struct { | |||
|
|||
// NodeTLSInfo contains the starting node TLS info to bootstrap into the agent | |||
NodeTLSInfo *api.NodeTLSInfo | |||
|
|||
// SessionTracker, if provided, will have its OnSessionClosed and OnSessionError methods called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still has On
in the names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
node/node.go
Outdated
// because of a TLS error, we don't want to exit the agent because a previously successful | ||
// session indicates that the TLS error may be a transient issue. | ||
|
||
// Note: locks are not needed on the assumption that these will only be called within the agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well add locks and remove the need for this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
node/node.go
Outdated
func (fs *firstSessionErrorTracker) SessionClosed() error { | ||
// unfortunately grpc connection errors are type grpc.rpcError, which are not exposed, and we can't get at the underlying error type | ||
if !fs.pastFirstSession && grpc.Code(fs.err) == codes.Internal && | ||
strings.HasPrefix(grpc.ErrorDesc(fs.err), "connection error") && strings.Contains(grpc.ErrorDesc(fs.err), "transport: x509:") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is terrible. grpc--
…ing, and registrations to be tracked. Signed-off-by: Ying Li <[email protected]>
930f38b
to
ee0a932
Compare
agent/agent.go
Outdated
@@ -206,6 +206,11 @@ func (a *Agent) run(ctx context.Context) { | |||
leaving = a.leaving | |||
subscriptions = map[string]context.CancelFunc{} | |||
) | |||
defer func() { | |||
if session != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible for session
to be nil. In the <-ctx.Done()
branch, session.close()
is called without this check.
That call should probably be removed now that closing the session is handled in a defer
. It's not harmful, but it's redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ee0a932
to
f03ac28
Compare
agent/agent.go
Outdated
@@ -206,6 +206,7 @@ func (a *Agent) run(ctx context.Context) { | |||
leaving = a.leaving | |||
subscriptions = map[string]context.CancelFunc{} | |||
) | |||
defer session.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful here. I believe this would call the close
method on the original session
.
Each time a "defer" statement executes, the function value and parameters to the call are evaluated as usual and saved anew but the actual function is not invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, good point - so wrapping it in a function is still the best idea since that evaluates when the function executes.
… established. If a session has never been established, and we get an x509 error, the make sure the agent does not rebuild new sessions and just exists. Make sure that this error gets propagated back from the node. Signed-off-by: Ying Li <[email protected]>
f03ac28
to
36eb01a
Compare
Also, should this be cherry-picked into 17.06? |
LGTM
I don't think this problem would come up very often or have a particularly severe effect, but please let me know if I've overlooked something. |
@aaronlehmann It probably won't, you're right. |
This is an attempt to address moby/moby#33380.
Basically, when a join address is passed, then if there are x509 errors when connecting to that join address, the agent should bail rather than attempt to rebuild a session over and over. This is on the assumption that a join address is only passed when attempting to join a cluster (see moby/moby#33361).
We want to make sure that, if the join address is passed, it only fails on x509 errors if there were no previous successful sessions. If there were previous successful sessions, this indicates that the x509 error is perhaps a result of one of the managers, as opposed to the node that is attempting to join.
Because of this, we have to keep track of how many sessions and session errors there've been on the agent, hence keeping track of state.
This is not the ideal implementation, because it is keeping track of agent lifetime state simply for the purpose of bailing early, and also because connection errors in GRPC aren't really checkable by type (since GRPC wraps and flattens all the errors down to just a description string).
An alternative solution is to load the certs and attempt to dial to the
JoinAddr
first, before even starting up the node, and see if it fails (see similar logic where we validate external CA urls: https://github.com/docker/swarmkit/blob/master/manager/controlapi/ca_rotation.go#L116).However, this adds a certain amount of lag on every
Node
instantiation, and also is not necessarily reliable because if there are non-x509 connection errors, we'd have to either just proceed onto starting the node and just continue retrying on non-x509 errors, or we have to wait for a while until either we've connected or failed with some kind of error.There was some discussion about changing the behavior of
Join
so that it does not try to join in the background, but will just fail after a given timeout, but as @aaronlehmann pointed out, this changes existing behavior.