-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openshift-api issues breaking cluster #21612
Comments
When your server comes back up (it will crash and try to recover) quickly run
And you should end up with a stable cluster. Just one without any monitoring |
we can bump domain memory for masters to 4Gi @openshift/sig-cloud |
Currently only the installer provisions masters (because once the cluster is running you'd need manual intervention to attach new etcd nodes). And installer-launched masters got bumped to 4GB in openshift/installer#785 (just landed). |
should be resolved now |
Reopening bc I'm still seeing these errors and tcp timeouts.
|
The current API instability may be a symptom of some underlying master instability. I don't know what's going on yet, but in a recent CI run, there was a running die-off of pods before the machine-config daemon pulled the plug and rebooted the node. Notes for the MCD part in openshift/machine-config-operator#224. Notes on etcd-member (the first pod to die) in openshift/installer#844. I don't know what's going on there, but I can certainly see occasional master reboots causing connectivity issues like these. |
I think this might have been resolved by openshift/machine-config-operator#225. Can anyone still reproduce? If not, can we close this? |
I still see the similar error in origin-template-service-broker E0104 16:14:25.634855 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request | E0104 16:15:56.017707 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request In the command line, Get 503 error: |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issue still present with v3.11 on OpenSuSE Tw. |
There are currently openshift-api issues that are breaking cluster to become unusable.
Steps To Reproduce
Current Result
openshift-apiserver is down with various connection refused errors:
E1204 19:21:10.940510 1 memcache.go:147] couldn't get resource list for samplesoperator.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/samplesoperator.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.948714 1 memcache.go:147] couldn't get resource list for servicecertsigner.config.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/servicecertsigner.config.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:10.953465 1 memcache.go:147] couldn't get resource list for tuned.openshift.io/v1alpha1: Get https://172.30.0.1:443/apis/tuned.openshift.io/v1alpha1?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.805850 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.RoleBinding: Get https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818657 1 reflector.go:136] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:129: Failed to list *core.LimitRange: Get https://172.30.0.1:443/api/v1/limitranges?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused E1204 19:21:11.818742 1 reflector.go:136] k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: connection refused
Cannot ssh into master, tcp/connection errors throughout pods, for a time oc and kubectl commands also not working. Eventually most of the pods break with CrashLoopError.
Expected Result
I don't expect any errors.
Additional Information
[try to run
$ oc adm diagnostics
(oroadm diagnostics
) command if possible][if you are reporting issue related to builds, provide build logs with
BUILD_LOGLEVEL=5
][consider attaching output of the
$ oc get all -o json -n <namespace>
command to the issue][visit https://docs.openshift.org/latest/welcome/index.html]
The text was updated successfully, but these errors were encountered: