-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCE cluster provision failure #14575
Comments
@smarterclayton do we have a SOP for these? |
As in, are we creating tickets somewhere for the GCE API? |
We would create tickets under our account, yes. And you need to have enough info that they can reproduce. If you can't reproduce, they basically don't help much. |
Can't really prioritize this one |
Adding p2 then since no priority will show up on the bug list to make sure it's triaged. Feel free to lower it further |
There's clearly something wrong with node setup in the installer, this is p0 until we know why. It looks similar to other racy startup failures, and it happens after networking is configured. |
@smarterclayton we have a bz that may or may not be similar but the cluster recovers: |
Yeah, suspected something similar. I have not been able to manually
reproduce it.
…On Wed, Jul 5, 2017 at 3:06 PM, Derek Carr ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> we have a bz that may
or may not be similar but the cluster recovers:
https://bugzilla.redhat.com/show_bug.cgi?id=1466732
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_Dzh5pDmPBoYUSnDf-1feO-sPwiks5sK96YgaJpZM4N2kgI>
.
|
Another gce provisioning flake in the installer: https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1262/consoleFull#68717269158b6e51eb7608a5981914356
|
Ugh, @sdodson know else knows the gce py
|
another:
|
(that one appears to be a quota exceeded issue) |
Fuck
…On Wed, Jul 12, 2017 at 1:13 PM, Ben Parees ***@***.***> wrote:
another:
TASK [provision : Provision GCE resources] *************************************
Wednesday 12 July 2017 15:46:04 +0000 (0:00:00.111) 0:00:02.965 ********
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/tmp/provision.sh"], "delta": "0:00:41.080842", "end": "2017-07-12 15:46:45.500767", "failed": true, "rc": 1, "start": "2017-07-12 15:46:04.419925", "stderr": "ERROR: (gcloud.compute.networks.create) Could not fetch resource:\n - Quota 'ROUTES' exceeded. Limit: 300.0\n - Quota 'SUBNETWORKS' exceeded. Limit: 275.0", "stderr_lines": ["ERROR: (gcloud.compute.networks.create) Could not fetch resource:", " - Quota 'ROUTES' exceeded. Limit: 300.0", " - Quota 'SUBNETWORKS' exceeded. Limit: 275.0"], "stdout": "", "stdout_lines": []}
PLAY RECAP *********************************************************************
localhost : ok=4 changed=3 unreachable=0 failed=1
Wednesday 12 July 2017 15:46:45 +0000 (0:00:41.202) 0:00:44.167 ********
===============================================================================
provision : Provision GCE resources ------------------------------------ 41.20s
provision : Provision GCE DNS domain ------------------------------------ 1.64s
provision : Templatize DNS script --------------------------------------- 0.69s
provision : Templatize provision script --------------------------------- 0.41s
provision : Ensure that DNS resolves to the hosted zone ----------------- 0.11s
Failure summary:
1. Host: localhost
Play: Ensure all cloud resources necessary for the cluster, including instances, have been started
Task: provision : Provision GCE resources
Message: ???
https://ci.openshift.redhat.com/jenkins/job/test_pull_
request_origin_extended_conformance_gce/4241/console
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p1YKU6w8Dvk9aiCWXtCtsJ5gSgCjks5sNP6jgaJpZM4N2kgI>
.
|
Working on a process to ensure cleanup in GCE so we don't hit quota. |
We have node logs now. What do they say?
On Jul 13, 2017, at 10:17 PM, Ben Parees <[email protected]> wrote:
next up:
Failure summary:
1. Host: ci-prtest-5a37c28-4303-ig-m-rvst
Play: Configure nodes
Task: openshift_node : restart node
Message: Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error code. See "systemctl status origin-node.service" and "journalctl
-xe" for details.
2. Host: ci-prtest-5a37c28-4303-ig-n-x2bq
Play: Configure nodes
Task: openshift_node : restart node
Message: Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error code. See "systemctl status origin-node.service" and "journalctl
-xe" for details.
3. Host: ci-prtest-5a37c28-4303-ig-n-qs80
Play: Configure nodes
Task: openshift_node : restart node
Message: Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error code. See "systemctl status origin-node.service" and "journalctl
-xe" for details.
4. Host: ci-prtest-5a37c28-4303-ig-n-hj1f
Play: Configure nodes
Task: openshift_node : restart node
Message: Unable to restart service origin-node: Job for
origin-node.service failed because the control process exited with
error code. See "systemctl status origin-node.service" and "journalctl
-xe" for details.
https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/4303
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p4WdlwxypCXMp5S6exszCMAfCJP8ks5sNs_BgaJpZM4N2kgI>
.
|
if they're gathered in s3 artifacts i don't see them. only journal item is docker.service |
Yeah, we need those. Really we need the full binary journal, no sense in gathering only specific journals when the ones we want are probably 90% of the total log volume anyway. I'll look at where that work got left off. |
Looks like we ensured the artifact gathering step could not fail but we still could have a single failure in the step skip executing of the rest of it, causing few or no artifacts to be gathered. I've fixed that in openshift-eng/aos-cd-jobs#433 |
For #15348 we may need apache/libcloud@03df6a8 |
Can you pick that into origin-gce?
On Jul 22, 2017, at 11:13 AM, Mo Khan <[email protected]> wrote:
For #15348 <#15348> we may need
apache/libcloud@03df6a8
<apache/libcloud@03df6a8>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p8BWvMvq0DcGxrmlyYrDd3gqSCLpks5sQhGWgaJpZM4N2kgI>
.
|
So what's in origin-gce is just gce.py which is the ansible inventory script which relies on libcloud. Which version of python2-libcloud is installed on the host? |
Nevermind the latest version of the package doesn't have that fix. |
@smarterclayton @enj do we have resolution for this |
@stevekuznetsov that fix is tagged with |
Just hit this, not identical but seem closely related:
|
Opened a ticket with GCE support.
…On Wed, Aug 23, 2017 at 1:28 PM, Simo Sorce ***@***.***> wrote:
Just hit this, not identical but seem closely related:
"ERROR: (gcloud.compute.instances.attach-disk) Could not fetch resource:", " - The resource 'projects/openshift-gce-devel-ci/zones/us-central1-a/instances/ci-prtest-5a37c28-6439-ig-n-38r7' was not found"], "stdout": "NAME MODE IPV4_RANGE GATEWAY_IPV4\nci-prtest-5a37c28-6439-ocp-network auto\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-master-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:80,tcp:443,tcp:1936,tcp:8080,tcp:8443,tcp:30000-32000,udp:30000-32000 ocp-master\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-icmp ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 icmp\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-ssh-internal ci-prtest-5a37c28-6439-ocp-network tcp:22 bastion\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-infra-node-internal ci-prtest-5a37c28-6439-ocp-network tcp:5000 ocp ocp-infra-node\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-master-internal ci-prtest-5a37c28-6439-ocp-network tcp:2224,tcp:2379,tcp:2380,tcp:4001,udp:4789,udp:5404,udp:5405,tcp:8053,udp:8053,tcp:8444,tcp:10250,tcp:10255,udp:10255,tcp:24224,udp:24224 ocp ocp-master\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-node-internal ci-prtest-5a37c28-6439-ocp-network udp:4789,tcp:10250,tcp:10255,udp:10255 ocp ocp-node,ocp-infra-node\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-infra-node-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:80,tcp:443,tcp:1936,tcp:30000-32000,udp:30000-32000 ocp-infra-node\nNAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS\nci-prtest-5a37c28-6439-ssh-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:22\nNAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-master n1-standard-2 2017-08-23T09:58:48.326-07:00\nNAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-node n1-standard-2 2017-08-23T09:58:48.262-07:00\nNAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-node-infra n1-standard-2 2017-08-23T09:58:48.210-07:00\nNAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED\nci-prtest-5a37c28-6439-ig-n us-central1-a zone ci-prtest-5a37c28-6439-ig-n 0 3 ci-prtest-5a37c28-6439-instance-template-node no\nNAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED\nci-prtest-5a37c28-6439-ig-m us-central1-a zone ci-prtest-5a37c28-6439-ig-m 0 1 ci-prtest-5a37c28-6439-instance-template-master no\nNAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED\nci-prtest-5a37c28-6439-ig-i us-central1-a zone ci-prtest-5a37c28-6439-ig-i 0 0 ci-prtest-5a37c28-6439-instance-template-node-infra no\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-38r7-docker us-central1-a 75 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-m-4644-docker us-central1-a 75 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-t26c-docker us-central1-a 75 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-9ctw-docker us-central1-a 75 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-38r7-openshift us-central1-a 50 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-m-4644-openshift us-central1-a 50 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-9ctw-openshift us-central1-a 50 pd-ssd READY\nNAME ZONE SIZE_GB TYPE STATUS\nci-prtest-5a37c28-6439-ig-n-t26c-openshift us-central1-a 50 pd-ssd READY", "stdout_lines": ["NAME MODE IPV4_RANGE GATEWAY_IPV4", "ci-prtest-5a37c28-6439-ocp-network auto", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-master-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:80,tcp:443,tcp:1936,tcp:8080,tcp:8443,tcp:30000-32000,udp:30000-32000 ocp-master", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-icmp ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 icmp", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-ssh-internal ci-prtest-5a37c28-6439-ocp-network tcp:22 bastion", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-infra-node-internal ci-prtest-5a37c28-6439-ocp-network tcp:5000 ocp ocp-infra-node", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-master-internal ci-prtest-5a37c28-6439-ocp-network tcp:2224,tcp:2379,tcp:2380,tcp:4001,udp:4789,udp:5404,udp:5405,tcp:8053,udp:8053,tcp:8444,tcp:10250,tcp:10255,udp:10255,tcp:24224,udp:24224 ocp ocp-master", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-node-internal ci-prtest-5a37c28-6439-ocp-network udp:4789,tcp:10250,tcp:10255,udp:10255 ocp ocp-node,ocp-infra-node", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-infra-node-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:80,tcp:443,tcp:1936,tcp:30000-32000,udp:30000-32000 ocp-infra-node", "NAME NETWORK SRC_RANGES RULES SRC_TAGS TARGET_TAGS", "ci-prtest-5a37c28-6439-ssh-external ci-prtest-5a37c28-6439-ocp-network 0.0.0.0/0 tcp:22", "NAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-master n1-standard-2 2017-08-23T09:58:48.326-07:00", "NAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-node n1-standard-2 2017-08-23T09:58:48.262-07:00", "NAME MACHINE_TYPE PREEMPTIBLE CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-node-infra n1-standard-2 2017-08-23T09:58:48.210-07:00", "NAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED", "ci-prtest-5a37c28-6439-ig-n us-central1-a zone ci-prtest-5a37c28-6439-ig-n 0 3 ci-prtest-5a37c28-6439-instance-template-node no", "NAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED", "ci-prtest-5a37c28-6439-ig-m us-central1-a zone ci-prtest-5a37c28-6439-ig-m 0 1 ci-prtest-5a37c28-6439-instance-template-master no", "NAME LOCATION SCOPE BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE AUTOSCALED", "ci-prtest-5a37c28-6439-ig-i us-central1-a zone ci-prtest-5a37c28-6439-ig-i 0 0 ci-prtest-5a37c28-6439-instance-template-node-infra no", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-38r7-docker us-central1-a 75 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-m-4644-docker us-central1-a 75 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-t26c-docker us-central1-a 75 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-9ctw-docker us-central1-a 75 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-38r7-openshift us-central1-a 50 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-m-4644-openshift us-central1-a 50 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-9ctw-openshift us-central1-a 50 pd-ssd READY", "NAME ZONE SIZE_GB TYPE STATUS", "ci-prtest-5a37c28-6439-ig-n-t26c-openshift us-central1-a 50 pd-ssd READY"]}
PLAY RECAP *********************************************************************
localhost : ok=4 changed=3 unreachable=0 failed=1
Wednesday 23 August 2017 16:59:46 +0000 (0:02:17.379) 0:02:20.148 ******
===============================================================================
provision : Provision GCE resources ----------------------------------- 137.38s
provision : Provision GCE DNS domain ------------------------------------ 1.52s
provision : Templatize DNS script --------------------------------------- 0.51s
provision : Templatize provision script --------------------------------- 0.39s
provision : Ensure that DNS resolves to the hosted zone ----------------- 0.08s
Failure summary:
1. Host: localhost
Play: Ensure all cloud resources necessary for the cluster, including instances, have been started
Task: provision : Provision GCE resources
Message: ???
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pzGOYrjciBmKz-sRrVR5o3YP-wbDks5sbGFagaJpZM4N2kgI>
.
|
Fixed by openshift/origin-gce#42 (because we don't
attach anymore) which is going to merge O(soon)
…On Wed, Aug 23, 2017 at 4:05 PM, Mo Khan ***@***.***> wrote:
Seen in #15907 <#15907>
https://ci.openshift.redhat.com/jenkins/job/test_pull_
request_origin_extended_conformance_gce/6456/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pzZg0VygvGLInIuUuo2mM5OhflyXks5sbIYbgaJpZM4N2kgI>
.
|
The change was delivered to origin-gce and a new image was built, but I'm
not seeing new jobs use the image yet for some reason.
On Aug 24, 2017, at 8:39 AM, Scott Dodson <[email protected]> wrote:
Assigned #14575 <#14575> to
@smarterclayton <https://github.com/smarterclayton>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9bl-O3oupOhw6P25iA8z0HOs8BJks5sbW7_gaJpZM4N2kgI>
.
|
/close
Fixed
On Thu, Aug 24, 2017 at 9:16 AM, Clayton Coleman <[email protected]>
wrote:
… The change was delivered to origin-gce and a new image was built, but I'm
not seeing new jobs use the image yet for some reason.
On Aug 24, 2017, at 8:39 AM, Scott Dodson ***@***.***>
wrote:
Assigned #14575 <#14575> to
@smarterclayton <https://github.com/smarterclayton>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14575 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9bl-O3oupOhw6P25iA8z0HOs8BJks5sbW7_gaJpZM4N2kgI>
.
|
Still seeing this:
|
We had some issues. Resolved today. |
https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/2997/consoleFull
The text was updated successfully, but these errors were encountered: