Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agent/node] Bail on cert error when joining #2206

Merged
merged 2 commits into from
Jun 5, 2017

Conversation

cyli
Copy link
Contributor

@cyli cyli commented May 30, 2017

This is an attempt to address moby/moby#33380.

Basically, when a join address is passed, then if there are x509 errors when connecting to that join address, the agent should bail rather than attempt to rebuild a session over and over. This is on the assumption that a join address is only passed when attempting to join a cluster (see moby/moby#33361).

We want to make sure that, if the join address is passed, it only fails on x509 errors if there were no previous successful sessions. If there were previous successful sessions, this indicates that the x509 error is perhaps a result of one of the managers, as opposed to the node that is attempting to join.

Because of this, we have to keep track of how many sessions and session errors there've been on the agent, hence keeping track of state.

This is not the ideal implementation, because it is keeping track of agent lifetime state simply for the purpose of bailing early, and also because connection errors in GRPC aren't really checkable by type (since GRPC wraps and flattens all the errors down to just a description string).

An alternative solution is to load the certs and attempt to dial to the JoinAddr first, before even starting up the node, and see if it fails (see similar logic where we validate external CA urls: https://github.com/docker/swarmkit/blob/master/manager/controlapi/ca_rotation.go#L116).

However, this adds a certain amount of lag on every Node instantiation, and also is not necessarily reliable because if there are non-x509 connection errors, we'd have to either just proceed onto starting the node and just continue retrying on non-x509 errors, or we have to wait for a while until either we've connected or failed with some kind of error.

There was some discussion about changing the behavior of Join so that it does not try to join in the background, but will just fail after a given timeout, but as @aaronlehmann pointed out, this changes existing behavior.

@cyli cyli force-pushed the bail-on-cert-error-when-joining branch 2 times, most recently from 8884a3c to 96da955 Compare May 30, 2017 22:56
@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from 96da955 to 1eeea15 Compare May 30, 2017 23:09
@codecov
Copy link

codecov bot commented May 30, 2017

Codecov Report

Merging #2206 into master will increase coverage by 0.13%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master    #2206      +/-   ##
==========================================
+ Coverage   60.03%   60.17%   +0.13%     
==========================================
  Files         124      124              
  Lines       20115    20143      +28     
==========================================
+ Hits        12077    12121      +44     
+ Misses       6680     6661      -19     
- Partials     1358     1361       +3

@cyli cyli changed the title [agent/node] Bail on cert error when joining WIP: [agent/node] Bail on cert error when joining May 30, 2017
@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from 1eeea15 to 5e42e09 Compare May 30, 2017 23:41
@cyli cyli changed the title WIP: [agent/node] Bail on cert error when joining [agent/node] Bail on cert error when joining May 31, 2017
@cyli
Copy link
Contributor Author

cyli commented May 31, 2017

I also want to test that if JoinAddr is not specified, that the agent will poll forever even if there is an x509 error, but I'm not sure how to do that without just making the integration test take a long time. :| Any suggestions would be welcome.

@aaronlehmann
Copy link
Collaborator

An alternative solution is to load the certs and attempt to dial to the JoinAddr first, before even starting up the node, and see if it fails

I believe we pass a connectionbroker.Broker into the agent. Right now we do this to seed the Remotes interface that the broker uses with the address we're joining:

                if n.config.JoinAddr != "" {
                        n.remotes.Observe(api.Peer{Addr: n.config.JoinAddr}, remotes.DefaultObservationWeight)
                }

I had the idea of making an alternate implementation of Broker that preconnects to this address, and is able to observe an x509 failure before starting the agent. However, it wouldn't really make this more efficient, because it wouldn't be able to return that connection to a future caller - callers of Select and SelectRemote can pass in their own connection-level options that we wouldn't know ahead of time.

So scratch that.

I guess the approach taken here makes sense. I'm not crazy about the extra complexity but I can't think of anything better.

I'd just suggest removing On from the method names.

@cyli
Copy link
Contributor Author

cyli commented May 31, 2017

@aaronlehmann Agree - if we could fail on any kind of error on first join, doing the check at the spot you point out I think would be a good idea. I'm also not super happy with how this works but I couldn't think of anything else for now.

Have removed On from the names.

@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from 5e42e09 to 930f38b Compare May 31, 2017 17:25
Copy link
Contributor

@diogomonica diogomonica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with all the above comments. Added complexity isn't good, but this is needed.

agent/config.go Outdated
@@ -43,6 +43,10 @@ type Config struct {

// NodeTLSInfo contains the starting node TLS info to bootstrap into the agent
NodeTLSInfo *api.NodeTLSInfo

// SessionTracker, if provided, will have its OnSessionClosed and OnSessionError methods called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has On in the names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

node/node.go Outdated
// because of a TLS error, we don't want to exit the agent because a previously successful
// session indicates that the TLS error may be a transient issue.

// Note: locks are not needed on the assumption that these will only be called within the agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well add locks and remove the need for this assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

node/node.go Outdated
func (fs *firstSessionErrorTracker) SessionClosed() error {
// unfortunately grpc connection errors are type grpc.rpcError, which are not exposed, and we can't get at the underlying error type
if !fs.pastFirstSession && grpc.Code(fs.err) == codes.Internal &&
strings.HasPrefix(grpc.ErrorDesc(fs.err), "connection error") && strings.Contains(grpc.ErrorDesc(fs.err), "transport: x509:") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is terrible. grpc--

…ing,

and registrations to be tracked.

Signed-off-by: Ying Li <[email protected]>
@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from 930f38b to ee0a932 Compare June 2, 2017 23:18
agent/agent.go Outdated
@@ -206,6 +206,11 @@ func (a *Agent) run(ctx context.Context) {
leaving = a.leaving
subscriptions = map[string]context.CancelFunc{}
)
defer func() {
if session != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible for session to be nil. In the <-ctx.Done() branch, session.close() is called without this check.

That call should probably be removed now that closing the session is handled in a defer. It's not harmful, but it's redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from ee0a932 to f03ac28 Compare June 5, 2017 18:10
agent/agent.go Outdated
@@ -206,6 +206,7 @@ func (a *Agent) run(ctx context.Context) {
leaving = a.leaving
subscriptions = map[string]context.CancelFunc{}
)
defer session.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful here. I believe this would call the close method on the original session.

Each time a "defer" statement executes, the function value and parameters to the call are evaluated as usual and saved anew but the actual function is not invoked.

https://golang.org/ref/spec#Defer_statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, good point - so wrapping it in a function is still the best idea since that evaluates when the function executes.

… established.

If a session has never been established, and we get an x509 error, the make sure the
agent does not rebuild new sessions and just exists.  Make sure that this error gets
propagated back from the node.

Signed-off-by: Ying Li <[email protected]>
@cyli cyli force-pushed the bail-on-cert-error-when-joining branch from f03ac28 to 36eb01a Compare June 5, 2017 18:27
@cyli
Copy link
Contributor Author

cyli commented Jun 5, 2017

Also, should this be cherry-picked into 17.06?

@aaronlehmann
Copy link
Collaborator

LGTM

Also, should this be cherry-picked into 17.06?

I don't think this problem would come up very often or have a particularly severe effect, but please let me know if I've overlooked something.

@aaronlehmann aaronlehmann merged commit bd6b571 into moby:master Jun 5, 2017
@cyli
Copy link
Contributor Author

cyli commented Jun 5, 2017

@aaronlehmann It probably won't, you're right.

@cyli cyli deleted the bail-on-cert-error-when-joining branch June 5, 2017 18:39
@cyli cyli restored the bail-on-cert-error-when-joining branch June 5, 2017 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants