-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment is racing to update replication controller #836
Comments
@ironcladlou @abhgupta this is beta1 blocker and need to find someone to fix it |
I'll take this. |
This is my theory: The DeploymentController updates the ReplicationController (deployment) to keep the deployment status in sync with the deployer Pod associated with the deployment. The deployer logic in the Pod does a GET for the deployment. By the time the deployer code tries to PUT the new replica count, the DeploymentController may have updated the status on the RC annotations (e.g. to move it from Pending->Running), hence the stale resourceVersion used by the deployer code. cc @pmorie |
Ok, so the client needs a merge operation that retrieves, checks that someone else hasn't set a size, and converges. We talked about this upstream somewhere about having client merge util, but until then you can emulate it pretty easily. ----- Original Message -----
|
@pmorie @smarterclayton @pweil- Some options here.
Thoughts? |
Right now, for beta1, even a few retries with no delay or back off would be better than what we have. ----- Original Message -----
|
I think a default retry count (maybe 3-5) and a wait between would be ok, with a check each time to see if the replication count has been updated. This is actually pretty relevant to the changes I'd like to make to the auto-scaler proposal that mention the interactions with a deployer updating a replication count must be able to work in the context of other components updating replica counts. (https://github.com/pweil-/kubernetes/pull/2) |
In the context here you don't need a wait - in a "race" you need to reread, check what has changed, and either proceed or fail immediately. This is a write conflict, not necessarily a logic conflict. ----- Original Message -----
|
Support retries in the Recreate deployment strategy. This is a simple fix for update race conditions as described in openshift#836.
Support retries in the Recreate deployment strategy. This is a simple fix for update race conditions as described in openshift#836.
Resolved by #846 |
…service-catalog/' changes from c91fecb..1bfff53 1bfff53 instance never provisioned should just delete (openshift#891) 1ae26db Adding a fake broker server (openshift#928) 6403076 docs: fix quoting issue, clarify naming in auth.md (openshift#931) 8ac0775 Merge branch 'pr/927' 02af952 Merge branch 'pr/876' 2aa84f9 add Jenkins badge to README 0c08788 Brokers must have at least one service (openshift#930) cbfa39b Add PodPreset support (openshift#917) 0d9b810 refactor Jenkins GitHub status postback to work on non-PR commits (openshift#916) 066159d Converting the AuthSecret field to a union AuthInfo type (openshift#877) 203af5c Add leader election namespace configuration (openshift#920) 5831502 Add example JSON schema to controller unit tests (openshift#918) b78ab99 Fix usage of finalizers (openshift#894) d3d29f0 Enable pprof in controller-manager (openshift#896) f4233a0 Correct parameter schema support (openshift#912) 05c6f00 bump image tags from v0.0.8 to v0.0.9 (openshift#910) 97d278a Add support for OSB parameter schemas (openshift#822) 3e4120e Fix nil dereference panic on request timeout (openshift#906) d8c7494 Add feature gate for audit options in helm chart (openshift#904) 89ce1cd Decompose controller unit tests (openshift#899) a1e83b2 Add e2e for walkthrough (openshift#832) 4679685 Add support for audit log options (openshift#897) 262a94f Do not allow updates to an object if asynchronous operation is in progress (openshift#853) 7295dad Validate that a ServiceClass must have at least one plan (openshift#879) 9db9fa4 Decompose controller.go (openshift#893) c3ea9bd Nits in our types (openshift#854) 1d8280a bump tags from v0.0.7 to v0.0.8 (openshift#892) 5e6925d Clean up the OSB client (openshift#888) fe6aee9 cleaning up logs and adding more log detail (openshift#874) f41516f Detect if a TPR update represents a soft delete (openshift#836) 9ce99f3 Add functions on Makefile for build and tag REVERT: c91fecb Merge pull request openshift#1 from jpeeler/origin-build REVERT: 55ccf3d origin build: add _output to .gitignore REVERT: 8352e14 origin build: make build-go and build-cross work REVERT: d969641 origin build: modify hard coded path REVERT: 30000cc origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: 1bfff53
…service-catalog/' changes from c91fecb..568a7b9 568a7b9 origin build: add origin tooling 1bfff53 instance never provisioned should just delete (openshift#891) 1ae26db Adding a fake broker server (openshift#928) 6403076 docs: fix quoting issue, clarify naming in auth.md (openshift#931) 8ac0775 Merge branch 'pr/927' 02af952 Merge branch 'pr/876' 2aa84f9 add Jenkins badge to README 0c08788 Brokers must have at least one service (openshift#930) cbfa39b Add PodPreset support (openshift#917) 0d9b810 refactor Jenkins GitHub status postback to work on non-PR commits (openshift#916) 066159d Converting the AuthSecret field to a union AuthInfo type (openshift#877) 203af5c Add leader election namespace configuration (openshift#920) 5831502 Add example JSON schema to controller unit tests (openshift#918) b78ab99 Fix usage of finalizers (openshift#894) d3d29f0 Enable pprof in controller-manager (openshift#896) f4233a0 Correct parameter schema support (openshift#912) 05c6f00 bump image tags from v0.0.8 to v0.0.9 (openshift#910) 97d278a Add support for OSB parameter schemas (openshift#822) 3e4120e Fix nil dereference panic on request timeout (openshift#906) d8c7494 Add feature gate for audit options in helm chart (openshift#904) 89ce1cd Decompose controller unit tests (openshift#899) a1e83b2 Add e2e for walkthrough (openshift#832) 4679685 Add support for audit log options (openshift#897) 262a94f Do not allow updates to an object if asynchronous operation is in progress (openshift#853) 7295dad Validate that a ServiceClass must have at least one plan (openshift#879) 9db9fa4 Decompose controller.go (openshift#893) c3ea9bd Nits in our types (openshift#854) 1d8280a bump tags from v0.0.7 to v0.0.8 (openshift#892) 5e6925d Clean up the OSB client (openshift#888) fe6aee9 cleaning up logs and adding more log detail (openshift#874) f41516f Detect if a TPR update represents a soft delete (openshift#836) 9ce99f3 Add functions on Makefile for build and tag REVERT: c91fecb Merge pull request openshift#1 from jpeeler/origin-build REVERT: 55ccf3d origin build: add _output to .gitignore REVERT: 8352e14 origin build: make build-go and build-cross work REVERT: d969641 origin build: modify hard coded path REVERT: 30000cc origin build: add origin tooling git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog git-subtree-split: 568a7b9dbdc4fdd1fabffdd52af030ec73124b89
* detect if a TPR update represents a soft delete if it does, then intercept the PUT, remove the deletion timestamp and deletion grace period, and add a finalizer this commit also splits some of the CRUD actions into common functions, so they can be reused * adding more finalizer-related metadata funcs * modify the delete function to remove finalizers * modify a delete test to check for finalizers * fixing final test for deletion * adding tests for the metadata accessors * modifying the update to check for deletion flags earlier also modifying the delete func to accept a code * fixing format string directive * change the fake REST client to return the expected HTTP response code deleting other resources in Kubernetes shows that the endpoint returns a 200. For example, here’s deleting a namespace: ```console ENG000656:service-catalog aaronschlesinger$ k delete ns catalog -v 10 I0511 16:43:04.375166 54273 loader.go:354] Config loaded from file /Users/aaronschlesinger/.kube/config I0511 16:43:04.379679 54273 cached_discovery.go:71] returning cached discovery info from /Users/aaronschlesinger/.kube/cache/discovery/192.168.99.100_8443/autosc aling/v1/serverresources.json <snip> I0511 16:43:04.383012 54273 round_trippers.go:398] curl -k -v -XDELETE -H "Accept: application/json" -H "User-Agent: kubectl/v0.0.0 (darwin/amd64) kubernetes/$Format" https://192.168.99.100:8443/api/v1/namespaces/catalog I0511 16:43:04.412668 54273 round_trippers.go:417] DELETE https://192.168.99.100:8443/api/v1/namespaces/catalog 200 OK in 29 milliseconds ``` * Moving metadata functions to a new package and using the new package everywhere * fixing typo in comment * improving error messages from binding tests * updating the rest client to soft-delete * adding more functions to the deletion metadata suite * refactoring the storage interface's update func so that it does complete updates for soft deletion, and has better docs * returning the appropriate content types from the fake server * fetching codecs in a type-agnostic way * ensuring that deletion timestamp can't be changed instead of preventing it from being passed at all * returning errors properly in the rest client * returning errors properly in the storage interface * renaming headers util file * fixing deletion timestamp deletion timestamp equality checks * fixing a unit test * adding the option to make the TPR storage interface hard-delete this is necessary for service class resources * adding missing godoc * fixing test compile errors * removing test regression and commenting * fixing format string value * remove incorrect, early return from the update func * splitting functions and test into more logical files * implementing remaining TODO tests * fix compile error the generated code has changed, and this field name has also changed * adding docs for the singular and list funcs * removing unused arg * fixing test compile errs * Adding more godoc to the RemoveFinalizer func
https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/772/artifact/origin/artifacts/e2e/logs/container-4b63c64b42ce.log
seeing these compare and swap problems. A, we should be logging a better error (maybe we're not using etcderr.InterpretUpdateError), B, we should figure out what is it racing against, and C, we should retry on race in most cases.
The text was updated successfully, but these errors were encountered: