Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment is racing to update replication controller #836

Closed
smarterclayton opened this issue Feb 2, 2015 · 9 comments
Closed

Deployment is racing to update replication controller #836

smarterclayton opened this issue Feb 2, 2015 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@smarterclayton
Copy link
Contributor

https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/772/artifact/origin/artifacts/e2e/logs/container-4b63c64b42ce.log

seeing these compare and swap problems. A, we should be logging a better error (maybe we're not using etcderr.InterpretUpdateError), B, we should figure out what is it racing against, and C, we should retry on race in most cases.

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Feb 2, 2015
@smarterclayton
Copy link
Contributor Author

@ironcladlou @abhgupta this is beta1 blocker and need to find someone to fix it

@ironcladlou
Copy link
Contributor

I'll take this.

@ironcladlou
Copy link
Contributor

This is my theory: The DeploymentController updates the ReplicationController (deployment) to keep the deployment status in sync with the deployer Pod associated with the deployment. The deployer logic in the Pod does a GET for the deployment. By the time the deployer code tries to PUT the new replica count, the DeploymentController may have updated the status on the RC annotations (e.g. to move it from Pending->Running), hence the stale resourceVersion used by the deployer code.

cc @pmorie

@smarterclayton
Copy link
Contributor Author

Ok, so the client needs a merge operation that retrieves, checks that someone else hasn't set a size, and converges. We talked about this upstream somewhere about having client merge util, but until then you can emulate it pretty easily.

----- Original Message -----

This is my theory: The DeploymentController updates the ReplicationController
(deployment) to keep the deployment status in sync with the deployer Pod
associated with the deployment. The deployer logic in the Pod does a GET for
the deployment. By the time the deployer code tries to PUT the new replica
count, the DeploymentController may have updated the status on the RC
annotations (e.g. to move it from Pending->Running), hence the stale
resourceVersion used by the deployer code.


Reply to this email directly or view it on GitHub:
#836 (comment)

@ironcladlou
Copy link
Contributor

@pmorie @smarterclayton @pweil-

Some options here.

  • Retry forever with backoff logic to prevent rapid spamming
  • Retry up to a timeout or count, also with backoff
  • Make the retry characteristics configurable as a new params for the recreate strategy

Thoughts?

@smarterclayton
Copy link
Contributor Author

Right now, for beta1, even a few retries with no delay or back off would be better than what we have.

----- Original Message -----

@pmorie @smarterclayton @pweil-

Some options here.

  • Retry forever with backoff logic to prevent rapid spamming
  • Retry up to a timeout or count, also with backoff
  • Make the retry characteristics configurable as a new params for the
    recreate strategy

Thoughts?


Reply to this email directly or view it on GitHub:
#836 (comment)

@pweil-
Copy link
Contributor

pweil- commented Feb 3, 2015

I think a default retry count (maybe 3-5) and a wait between would be ok, with a check each time to see if the replication count has been updated.

This is actually pretty relevant to the changes I'd like to make to the auto-scaler proposal that mention the interactions with a deployer updating a replication count must be able to work in the context of other components updating replica counts. (https://github.com/pweil-/kubernetes/pull/2)

@smarterclayton
Copy link
Contributor Author

In the context here you don't need a wait - in a "race" you need to reread, check what has changed, and either proceed or fail immediately. This is a write conflict, not necessarily a logic conflict.

----- Original Message -----

I think a default retry count (maybe 3-5) and a wait between would be ok,
with a check each time to see if the replication count has been updated.

This is actually pretty relevant to the changes I'd like to make to the
auto-scaler proposal that mention the interactions with a deployer updating
a replication count must be able to work in the context of other components
updating replica counts. (https://github.com/pweil-/kubernetes/pull/2)


Reply to this email directly or view it on GitHub:
#836 (comment)

ironcladlou added a commit to ironcladlou/origin that referenced this issue Feb 3, 2015
Support retries in the Recreate deployment strategy. This is a simple
fix for update race conditions as described in openshift#836.
ironcladlou added a commit to ironcladlou/origin that referenced this issue Feb 3, 2015
Support retries in the Recreate deployment strategy. This is a simple
fix for update race conditions as described in openshift#836.
@smarterclayton smarterclayton modified the milestone: 0.3.0 (beta1) Feb 3, 2015
@ironcladlou
Copy link
Contributor

Resolved by #846

jpeeler pushed a commit to jpeeler/origin that referenced this issue Jun 15, 2017
…service-catalog/' changes from c91fecb..1bfff53

1bfff53 instance never provisioned should just delete (openshift#891)
1ae26db Adding a fake broker server (openshift#928)
6403076 docs: fix quoting issue, clarify naming in auth.md (openshift#931)
8ac0775 Merge branch 'pr/927'
02af952 Merge branch 'pr/876'
2aa84f9 add Jenkins badge to README
0c08788 Brokers must have at least one service (openshift#930)
cbfa39b Add PodPreset support (openshift#917)
0d9b810 refactor Jenkins GitHub status postback to work on non-PR commits (openshift#916)
066159d Converting the AuthSecret field to a union AuthInfo type (openshift#877)
203af5c Add leader election namespace configuration (openshift#920)
5831502 Add example JSON schema to controller unit tests (openshift#918)
b78ab99 Fix usage of finalizers (openshift#894)
d3d29f0 Enable pprof in controller-manager (openshift#896)
f4233a0 Correct parameter schema support (openshift#912)
05c6f00 bump image tags from v0.0.8 to v0.0.9 (openshift#910)
97d278a Add support for OSB parameter schemas (openshift#822)
3e4120e Fix nil dereference panic on request timeout (openshift#906)
d8c7494 Add feature gate for audit options in helm chart (openshift#904)
89ce1cd Decompose controller unit tests (openshift#899)
a1e83b2 Add e2e for walkthrough (openshift#832)
4679685 Add support for audit log options (openshift#897)
262a94f Do not allow updates to an object if asynchronous operation is in progress (openshift#853)
7295dad Validate that a ServiceClass must have at least one plan (openshift#879)
9db9fa4 Decompose controller.go (openshift#893)
c3ea9bd Nits in our types (openshift#854)
1d8280a bump tags from v0.0.7 to v0.0.8 (openshift#892)
5e6925d Clean up the OSB client (openshift#888)
fe6aee9 cleaning up logs and adding more log detail (openshift#874)
f41516f Detect if a TPR update represents a soft delete (openshift#836)
9ce99f3 Add functions on Makefile for build and tag
REVERT: c91fecb Merge pull request openshift#1 from jpeeler/origin-build
REVERT: 55ccf3d origin build: add _output to .gitignore
REVERT: 8352e14 origin build: make build-go and build-cross work
REVERT: d969641 origin build: modify hard coded path
REVERT: 30000cc origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: 1bfff53
jpeeler pushed a commit to jpeeler/origin that referenced this issue Jun 15, 2017
…service-catalog/' changes from c91fecb..568a7b9

568a7b9 origin build: add origin tooling
1bfff53 instance never provisioned should just delete (openshift#891)
1ae26db Adding a fake broker server (openshift#928)
6403076 docs: fix quoting issue, clarify naming in auth.md (openshift#931)
8ac0775 Merge branch 'pr/927'
02af952 Merge branch 'pr/876'
2aa84f9 add Jenkins badge to README
0c08788 Brokers must have at least one service (openshift#930)
cbfa39b Add PodPreset support (openshift#917)
0d9b810 refactor Jenkins GitHub status postback to work on non-PR commits (openshift#916)
066159d Converting the AuthSecret field to a union AuthInfo type (openshift#877)
203af5c Add leader election namespace configuration (openshift#920)
5831502 Add example JSON schema to controller unit tests (openshift#918)
b78ab99 Fix usage of finalizers (openshift#894)
d3d29f0 Enable pprof in controller-manager (openshift#896)
f4233a0 Correct parameter schema support (openshift#912)
05c6f00 bump image tags from v0.0.8 to v0.0.9 (openshift#910)
97d278a Add support for OSB parameter schemas (openshift#822)
3e4120e Fix nil dereference panic on request timeout (openshift#906)
d8c7494 Add feature gate for audit options in helm chart (openshift#904)
89ce1cd Decompose controller unit tests (openshift#899)
a1e83b2 Add e2e for walkthrough (openshift#832)
4679685 Add support for audit log options (openshift#897)
262a94f Do not allow updates to an object if asynchronous operation is in progress (openshift#853)
7295dad Validate that a ServiceClass must have at least one plan (openshift#879)
9db9fa4 Decompose controller.go (openshift#893)
c3ea9bd Nits in our types (openshift#854)
1d8280a bump tags from v0.0.7 to v0.0.8 (openshift#892)
5e6925d Clean up the OSB client (openshift#888)
fe6aee9 cleaning up logs and adding more log detail (openshift#874)
f41516f Detect if a TPR update represents a soft delete (openshift#836)
9ce99f3 Add functions on Makefile for build and tag
REVERT: c91fecb Merge pull request openshift#1 from jpeeler/origin-build
REVERT: 55ccf3d origin build: add _output to .gitignore
REVERT: 8352e14 origin build: make build-go and build-cross work
REVERT: d969641 origin build: modify hard coded path
REVERT: 30000cc origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: 568a7b9dbdc4fdd1fabffdd52af030ec73124b89
jpeeler pushed a commit to jpeeler/origin that referenced this issue Feb 1, 2018
* detect if a TPR update represents a soft delete

if it does, then intercept the PUT, remove the deletion timestamp and
deletion grace period, and add a finalizer

this commit also splits some of the CRUD actions into common functions,
so they can be reused

* adding more finalizer-related metadata funcs

* modify the delete function to remove finalizers

* modify a delete test to check for finalizers

* fixing final test for deletion

* adding tests for the metadata accessors

* modifying the update to check for deletion flags earlier

also modifying the delete func to accept a code

* fixing format string directive

* change the fake REST client to return the expected HTTP response code

deleting other resources in Kubernetes shows that the endpoint returns
a 200. For example, here’s deleting a namespace:

```console
ENG000656:service-catalog aaronschlesinger$ k delete ns catalog -v 10
I0511 16:43:04.375166   54273 loader.go:354] Config loaded from file
/Users/aaronschlesinger/.kube/config
I0511 16:43:04.379679   54273 cached_discovery.go:71] returning cached
discovery info from
/Users/aaronschlesinger/.kube/cache/discovery/192.168.99.100_8443/autosc
aling/v1/serverresources.json
<snip>
I0511 16:43:04.383012   54273 round_trippers.go:398] curl -k -v
-XDELETE  -H "Accept: application/json" -H "User-Agent: kubectl/v0.0.0
(darwin/amd64) kubernetes/$Format"
https://192.168.99.100:8443/api/v1/namespaces/catalog
I0511 16:43:04.412668   54273 round_trippers.go:417] DELETE
https://192.168.99.100:8443/api/v1/namespaces/catalog 200 OK in 29
milliseconds
```

* Moving metadata functions to a new package

and using the new package everywhere

* fixing typo in comment

* improving error messages from binding tests

* updating the rest client to soft-delete

* adding more functions to the deletion metadata suite

* refactoring the storage interface's update func

so that it does complete updates for soft deletion, and has better docs

* returning the appropriate content types from the fake server

* fetching codecs in a type-agnostic way

* ensuring that deletion timestamp can't be changed

instead of preventing it from being passed at all

* returning errors properly in the rest client

* returning errors properly in the storage interface

* renaming headers util file

* fixing deletion timestamp deletion timestamp equality checks

* fixing a unit test

* adding the option to make the TPR storage interface hard-delete

this is necessary for service class resources

* adding missing godoc

* fixing test compile errors

* removing test regression

and commenting

* fixing format string value

* remove incorrect, early return from the update func

* splitting functions and test into more logical files

* implementing remaining TODO tests

* fix compile error

the generated code has changed, and this field name has also changed

* adding docs for the singular and list funcs

* removing unused arg

* fixing test compile errs

* Adding more godoc to the RemoveFinalizer func
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants