Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of container failure to start #4294

Closed
hickeng opened this issue Mar 17, 2017 · 9 comments
Closed

Inconsistent handling of container failure to start #4294

hickeng opened this issue Mar 17, 2017 · 9 comments
Assignees
Labels
component/portlayer/network impact/doc/note Requires creation of or changes to an official release note kind/defect Behavior that is inconsistent with what's intended priority/p2 source/customer Reported by a customer, directly or via an intermediary team/foundation
Milestone

Comments

@hickeng
Copy link
Member

hickeng commented Mar 17, 2017

This is identified during investigation of #4289

The containerVM in question is not setting the session.started flag and as such the property collector times out. The reason why it's not being set is unimportant in this issue.
The problem is how we handle that failure - we Unbind the network config, but we do not power down the container. This means that if the container process didn't start successfully we have a useless containerVM consuming resource. If it was a network hicup/VC queuing/other control plane issue, then we've just disconnected a functioning containerVM from the network it required.

Solutions:

  1. Do not unbind on the error path after power on has been confirmed
  2. Ensure that the VM is powered down if it fails to report status by the deadline

In either case, the Unbind can be triggered by the VM power off instead of explciitly.

Notes:
This impacts cVMs that start after the current 3min timeout has expired. The timeout for cVM start was added to address failure scenarios and because docker run -it inherits the awkward blocking behaviour of the standard docker client when attach is used (interception of Ctrl-C, et al) meaning it cannot be easily escaped. The correct solution to this is:

  1. address the problem behaviour in docker client that results in timeouts being used
  2. ensure that cVMs either come up cleanly or report failure and shut themselves down
  3. only unbind network addresses on power off

This likely means changing the power state operations to be async and then waiting on events (either the expected status change or an error). I've upated the estimate to encompass a possible shift to async power operations but doesn't not include raising a PR for the signal forwarding behaviour of docker client.

@hickeng hickeng added component/portlayer/network kind/defect Behavior that is inconsistent with what's intended labels Mar 17, 2017
@hickeng hickeng added the source/customer Reported by a customer, directly or via an intermediary label Sep 25, 2017
@hickeng hickeng changed the title Inconsistent handling of failure to container failure to start Inconsistent handling of container failure to start Sep 25, 2017
@mdubya66
Copy link
Contributor

mdubya66 commented Oct 6, 2017

putting into 1.3 and marking high pri, per @pdaigle

@mdubya66 mdubya66 added the impact/doc/note Requires creation of or changes to an official release note label Jan 8, 2018
@stuclem
Copy link
Contributor

stuclem commented Jan 9, 2018

@hickeng and @mdubya66 I have no idea how to write this up for the release notes. Can you provide some user visible symptoms?

@hickeng
Copy link
Member Author

hickeng commented Jan 16, 2018

@stuclem A container that times out while starting will result in an error message "context deadline exceeded".

When that occurs the containerVM is not powered off but is left in state "Starting" and may not have a configured network interface. A secondary consequence is that docker-compose and other tools that perform operations based on container state may not handle Starting correctly; in the case of compose it does not stop the container before trying to remove it.

@stuclem
Copy link
Contributor

stuclem commented Jan 17, 2018

Thanks @hickeng. I added your writeup to the RNs.

@stuclem stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Jan 17, 2018
@sgairo
Copy link
Contributor

sgairo commented Mar 28, 2018

This is for handling error path, lowering to p2. Root cause has been fixed.

@anchal-agrawal
Copy link
Contributor

Removed from 1.4 release and OKR - Customer Environments.

@yuyangbj
Copy link
Contributor

I think the fix #8445 supports solution 1. We will not unbind container if we found error later.

@yuyangbj yuyangbj added this to the Sprint 42 milestone Jan 30, 2019
@renmaosheng renmaosheng modified the milestones: Sprint 42, Sprint 44 Feb 11, 2019
@yuyangbj
Copy link
Contributor

PR #8445 has already been merged. Closing it.

@yuyangbj yuyangbj added the impact/doc/user Requires changes to official user documentation label Feb 26, 2019
@stuclem stuclem added impact/doc/note Requires creation of or changes to an official release note and removed impact/doc/user Requires changes to official user documentation labels Mar 5, 2019
@stuclem
Copy link
Contributor

stuclem commented Mar 5, 2019

I don't think that this affects the core user doc, but it does need to be marked as a resolved issue in the 1.5.2 release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/portlayer/network impact/doc/note Requires creation of or changes to an official release note kind/defect Behavior that is inconsistent with what's intended priority/p2 source/customer Reported by a customer, directly or via an intermediary team/foundation
Projects
None yet
Development

No branches or pull requests

7 participants