Inconsistent handling of container failure to start #4294

hickeng · 2017-03-17T21:02:37Z

This is identified during investigation of #4289

The containerVM in question is not setting the session.started flag and as such the property collector times out. The reason why it's not being set is unimportant in this issue.
The problem is how we handle that failure - we Unbind the network config, but we do not power down the container. This means that if the container process didn't start successfully we have a useless containerVM consuming resource. If it was a network hicup/VC queuing/other control plane issue, then we've just disconnected a functioning containerVM from the network it required.

Solutions:

Do not unbind on the error path after power on has been confirmed
Ensure that the VM is powered down if it fails to report status by the deadline

In either case, the Unbind can be triggered by the VM power off instead of explciitly.

Notes:
This impacts cVMs that start after the current 3min timeout has expired. The timeout for cVM start was added to address failure scenarios and because docker run -it inherits the awkward blocking behaviour of the standard docker client when attach is used (interception of Ctrl-C, et al) meaning it cannot be easily escaped. The correct solution to this is:

address the problem behaviour in docker client that results in timeouts being used
ensure that cVMs either come up cleanly or report failure and shut themselves down
only unbind network addresses on power off

This likely means changing the power state operations to be async and then waiting on events (either the expected status change or an error). I've upated the estimate to encompass a possible shift to async power operations but doesn't not include raising a PR for the signal forwarding behaviour of docker client.

The text was updated successfully, but these errors were encountered:

mdubya66 · 2017-10-06T15:23:46Z

putting into 1.3 and marking high pri, per @pdaigle

stuclem · 2018-01-09T19:11:58Z

@hickeng and @mdubya66 I have no idea how to write this up for the release notes. Can you provide some user visible symptoms?

hickeng · 2018-01-16T22:26:36Z

@stuclem A container that times out while starting will result in an error message "context deadline exceeded".

When that occurs the containerVM is not powered off but is left in state "Starting" and may not have a configured network interface. A secondary consequence is that docker-compose and other tools that perform operations based on container state may not handle Starting correctly; in the case of compose it does not stop the container before trying to remove it.

stuclem · 2018-01-17T10:40:49Z

Thanks @hickeng. I added your writeup to the RNs.

sgairo · 2018-03-28T18:54:59Z

This is for handling error path, lowering to p2. Root cause has been fixed.

anchal-agrawal · 2018-03-28T18:56:35Z

Removed from 1.4 release and OKR - Customer Environments.

yuyangbj · 2019-01-10T07:08:54Z

I think the fix #8445 supports solution 1. We will not unbind container if we found error later.

yuyangbj · 2019-02-15T06:23:01Z

PR #8445 has already been merged. Closing it.

stuclem · 2019-03-05T10:13:52Z

I don't think that this affects the core user doc, but it does need to be marked as a resolved issue in the 1.5.2 release notes.

hickeng added component/portlayer/network kind/defect Behavior that is inconsistent with what's intended labels Mar 17, 2017

hickeng added the priority/p2 label Apr 27, 2017

hickeng added the source/customer Reported by a customer, directly or via an intermediary label Sep 25, 2017

hickeng changed the title ~~Inconsistent handling of failure to container failure to start~~ Inconsistent handling of container failure to start Sep 25, 2017

hickeng added team/container team/foundation and removed team/container labels Sep 25, 2017

mdubya66 added priority/p0 and removed priority/p2 labels Oct 6, 2017

mdubya66 added the impact/doc/note Requires creation of or changes to an official release note label Jan 8, 2018

stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Jan 17, 2018

sgairo added priority/p2 and removed priority/p0 labels Mar 28, 2018

renmaosheng assigned yuyangbj Jan 10, 2019

yuyangbj added this to the Sprint 42 milestone Jan 30, 2019

renmaosheng modified the milestones: Sprint 42, Sprint 44 Feb 11, 2019

yuyangbj closed this as completed Feb 15, 2019

yuyangbj added the impact/doc/user Requires changes to official user documentation label Feb 26, 2019

stuclem added impact/doc/note Requires creation of or changes to an official release note and removed impact/doc/user Requires changes to official user documentation labels Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of container failure to start #4294

Inconsistent handling of container failure to start #4294

hickeng commented Mar 17, 2017 •

edited

Loading

mdubya66 commented Oct 6, 2017

stuclem commented Jan 9, 2018

hickeng commented Jan 16, 2018

stuclem commented Jan 17, 2018

sgairo commented Mar 28, 2018

anchal-agrawal commented Mar 28, 2018

yuyangbj commented Jan 10, 2019

yuyangbj commented Feb 15, 2019

stuclem commented Mar 5, 2019

Inconsistent handling of container failure to start #4294

Inconsistent handling of container failure to start #4294

Comments

hickeng commented Mar 17, 2017 • edited Loading

mdubya66 commented Oct 6, 2017

stuclem commented Jan 9, 2018

hickeng commented Jan 16, 2018

stuclem commented Jan 17, 2018

sgairo commented Mar 28, 2018

anchal-agrawal commented Mar 28, 2018

yuyangbj commented Jan 10, 2019

yuyangbj commented Feb 15, 2019

stuclem commented Mar 5, 2019

hickeng commented Mar 17, 2017 •

edited

Loading