Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE cluster provision failure #14575

Closed
bparees opened this issue Jun 12, 2017 · 38 comments
Closed

GCE cluster provision failure #14575

bparees opened this issue Jun 12, 2017 · 38 comments
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0

Comments

@bparees
Copy link
Contributor

bparees commented Jun 12, 2017

Failure summary:

  1. Host:     ci-prtest-5a37c28-2997-ig-n-fndk
     Play:     schedulable_nodes
     Task:     openshift-volume-quota : Create filesystem for /var/lib/origin/openshift.local.volumes
     Message:  Device /dev/sdc not found.

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/2997/consoleFull

@bparees bparees added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels Jun 12, 2017
@stevekuznetsov
Copy link
Contributor

@smarterclayton do we have a SOP for these?

@stevekuznetsov
Copy link
Contributor

As in, are we creating tickets somewhere for the GCE API?

@smarterclayton
Copy link
Contributor

We would create tickets under our account, yes. And you need to have enough info that they can reproduce. If you can't reproduce, they basically don't help much.

@stevekuznetsov
Copy link
Contributor

Can't really prioritize this one

@pweil-
Copy link
Contributor

pweil- commented Jun 19, 2017

Can't really prioritize this one

Adding p2 then since no priority will show up on the bug list to make sure it's triaged. Feel free to lower it further

@smarterclayton
Copy link
Contributor

There's clearly something wrong with node setup in the installer, this is p0 until we know why. It looks similar to other racy startup failures, and it happens after networking is configured.

@derekwaynecarr
Copy link
Member

@smarterclayton we have a bz that may or may not be similar but the cluster recovers:
https://bugzilla.redhat.com/show_bug.cgi?id=1466732

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 5, 2017 via email

@0xmichalis
Copy link
Contributor

Another gce provisioning flake in the installer: https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1262/consoleFull#68717269158b6e51eb7608a5981914356

TASK [dynamic-inventory : Templatize environment script] ***********************
Tuesday 11 July 2017  17:12:20 +0000 (0:00:01.010)       0:03:39.415 ********** 
ok: [localhost]

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
Tuesday 11 July 2017  17:12:22 +0000 (0:00:01.467)       0:03:40.882 ********** 
ok: [localhost]
ERROR! Attempted to execute "/usr/share/ansible/openshift-ansible-gce/inventory.sh" as inventory script: Inventory script (/usr/share/ansible/openshift-ansible-gce/inventory.sh) had an execution error: Traceback (most recent call last):
  File "/usr/share/ansible/openshift-ansible-gce/inventory/gce/hosts/gce.py", line 400, in <module>
    GceInventory()
  File "/usr/share/ansible/openshift-ansible-gce/inventory/gce/hosts/gce.py", line 132, in __init__
    print(self.json_format_dict(self.group_instances(zones),
  File "/usr/share/ansible/openshift-ansible-gce/inventory/gce/hosts/gce.py", line 320, in group_instances
    nodes = self.driver.list_nodes()
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 2419, in list_nodes
    use_disk_cache=ex_use_disk_cache)
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 8366, in _to_node
    bd['name'], bd['zone'], use_cache=use_disk_cache)
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 7109, in ex_get_volume
    return self._ex_lookup_volume(name, zone)
  File "/usr/lib/python2.7/site-packages/libcloud/compute/drivers/gce.py", line 7405, in _ex_lookup_volume
    self._ex_populate_dict()
AttributeError: 'GCENodeDriver' object has no attribute '_ex_populate_dict'

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 11, 2017 via email

@pweil- pweil- assigned sdodson and unassigned stevekuznetsov Jul 12, 2017
@bparees
Copy link
Contributor Author

bparees commented Jul 12, 2017

another:

TASK [provision : Provision GCE resources] *************************************
Wednesday 12 July 2017  15:46:04 +0000 (0:00:00.111)       0:00:02.965 ******** 
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/tmp/provision.sh"], "delta": "0:00:41.080842", "end": "2017-07-12 15:46:45.500767", "failed": true, "rc": 1, "start": "2017-07-12 15:46:04.419925", "stderr": "ERROR: (gcloud.compute.networks.create) Could not fetch resource:\n - Quota 'ROUTES' exceeded.  Limit: 300.0\n - Quota 'SUBNETWORKS' exceeded.  Limit: 275.0", "stderr_lines": ["ERROR: (gcloud.compute.networks.create) Could not fetch resource:", " - Quota 'ROUTES' exceeded.  Limit: 300.0", " - Quota 'SUBNETWORKS' exceeded.  Limit: 275.0"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=1   

Wednesday 12 July 2017  15:46:45 +0000 (0:00:41.202)       0:00:44.167 ******** 
=============================================================================== 
provision : Provision GCE resources ------------------------------------ 41.20s
provision : Provision GCE DNS domain ------------------------------------ 1.64s
provision : Templatize DNS script --------------------------------------- 0.69s
provision : Templatize provision script --------------------------------- 0.41s
provision : Ensure that DNS resolves to the hosted zone ----------------- 0.11s

Failure summary:

  1. Host:     localhost
     Play:     Ensure all cloud resources necessary for the cluster, including instances, have been started
     Task:     provision : Provision GCE resources
     Message:  ???

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/4241/console

@bparees
Copy link
Contributor Author

bparees commented Jul 12, 2017

(that one appears to be a quota exceeded issue)

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 12, 2017 via email

@stevekuznetsov
Copy link
Contributor

Working on a process to ensure cleanup in GCE so we don't hit quota.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 14, 2017 via email

@bparees
Copy link
Contributor Author

bparees commented Jul 14, 2017

if they're gathered in s3 artifacts i don't see them. only journal item is docker.service

@sdodson
Copy link
Member

sdodson commented Jul 14, 2017

Yeah, we need those. Really we need the full binary journal, no sense in gathering only specific journals when the ones we want are probably 90% of the total log volume anyway. I'll look at where that work got left off.

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Jul 14, 2017

Looks like we ensured the artifact gathering step could not fail but we still could have a single failure in the step skip executing of the rest of it, causing few or no artifacts to be gathered. I've fixed that in openshift-eng/aos-cd-jobs#433

@enj
Copy link
Contributor

enj commented Jul 22, 2017

For #15348 we may need apache/libcloud@03df6a8

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 22, 2017 via email

@sdodson
Copy link
Member

sdodson commented Jul 26, 2017

So what's in origin-gce is just gce.py which is the ansible inventory script which relies on libcloud. Which version of python2-libcloud is installed on the host?

@sdodson
Copy link
Member

sdodson commented Jul 26, 2017

Nevermind the latest version of the package doesn't have that fix.

@stevekuznetsov
Copy link
Contributor

@smarterclayton @enj do we have resolution for this

@enj
Copy link
Contributor

enj commented Aug 22, 2017

@stevekuznetsov that fix is tagged with v2.1.0-tentative but the highest RPM release I have seen is v2.0.

@simo5
Copy link
Contributor

simo5 commented Aug 23, 2017

Just hit this, not identical but seem closely related:

"ERROR: (gcloud.compute.instances.attach-disk) Could not fetch resource:", " - The resource 'projects/openshift-gce-devel-ci/zones/us-central1-a/instances/ci-prtest-5a37c28-6439-ig-n-38r7' was not found"], "stdout": "NAME                                MODE  IPV4_RANGE  GATEWAY_IPV4\nci-prtest-5a37c28-6439-ocp-network  auto\nNAME                                    NETWORK                             SRC_RANGES  RULES                                                                      SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-master-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:80,tcp:443,tcp:1936,tcp:8080,tcp:8443,tcp:30000-32000,udp:30000-32000            ocp-master\nNAME                         NETWORK                             SRC_RANGES  RULES  SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-icmp  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   icmp\nNAME                                 NETWORK                             SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-ssh-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:22  bastion\nNAME                                        NETWORK                             SRC_RANGES  RULES     SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-infra-node-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:5000  ocp       ocp-infra-node\nNAME                                    NETWORK                             SRC_RANGES  RULES                                                                                                                                        SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-master-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:2224,tcp:2379,tcp:2380,tcp:4001,udp:4789,udp:5404,udp:5405,tcp:8053,udp:8053,tcp:8444,tcp:10250,tcp:10255,udp:10255,tcp:24224,udp:24224  ocp       ocp-master\nNAME                                  NETWORK                             SRC_RANGES  RULES                                   SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-node-internal  ci-prtest-5a37c28-6439-ocp-network              udp:4789,tcp:10250,tcp:10255,udp:10255  ocp       ocp-node,ocp-infra-node\nNAME                                        NETWORK                             SRC_RANGES  RULES                                                    SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-infra-node-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:80,tcp:443,tcp:1936,tcp:30000-32000,udp:30000-32000            ocp-infra-node\nNAME                                 NETWORK                             SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS\nci-prtest-5a37c28-6439-ssh-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:22\nNAME                                             MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-master  n1-standard-2               2017-08-23T09:58:48.326-07:00\nNAME                                           MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-node  n1-standard-2               2017-08-23T09:58:48.262-07:00\nNAME                                                 MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP\nci-prtest-5a37c28-6439-instance-template-node-infra  n1-standard-2               2017-08-23T09:58:48.210-07:00\nNAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                              AUTOSCALED\nci-prtest-5a37c28-6439-ig-n  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-n  0     3            ci-prtest-5a37c28-6439-instance-template-node  no\nNAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                                AUTOSCALED\nci-prtest-5a37c28-6439-ig-m  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-m  0     1            ci-prtest-5a37c28-6439-instance-template-master  no\nNAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                                    AUTOSCALED\nci-prtest-5a37c28-6439-ig-i  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-i  0     0            ci-prtest-5a37c28-6439-instance-template-node-infra  no\nNAME                                     ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-38r7-docker  us-central1-a  75       pd-ssd  READY\nNAME                                     ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-m-4644-docker  us-central1-a  75       pd-ssd  READY\nNAME                                     ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-t26c-docker  us-central1-a  75       pd-ssd  READY\nNAME                                     ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-9ctw-docker  us-central1-a  75       pd-ssd  READY\nNAME                                        ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-38r7-openshift  us-central1-a  50       pd-ssd  READY\nNAME                                        ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-m-4644-openshift  us-central1-a  50       pd-ssd  READY\nNAME                                        ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-9ctw-openshift  us-central1-a  50       pd-ssd  READY\nNAME                                        ZONE           SIZE_GB  TYPE    STATUS\nci-prtest-5a37c28-6439-ig-n-t26c-openshift  us-central1-a  50       pd-ssd  READY", "stdout_lines": ["NAME                                MODE  IPV4_RANGE  GATEWAY_IPV4", "ci-prtest-5a37c28-6439-ocp-network  auto", "NAME                                    NETWORK                             SRC_RANGES  RULES                                                                      SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-master-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:80,tcp:443,tcp:1936,tcp:8080,tcp:8443,tcp:30000-32000,udp:30000-32000            ocp-master", "NAME                         NETWORK                             SRC_RANGES  RULES  SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-icmp  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   icmp", "NAME                                 NETWORK                             SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-ssh-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:22  bastion", "NAME                                        NETWORK                             SRC_RANGES  RULES     SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-infra-node-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:5000  ocp       ocp-infra-node", "NAME                                    NETWORK                             SRC_RANGES  RULES                                                                                                                                        SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-master-internal  ci-prtest-5a37c28-6439-ocp-network              tcp:2224,tcp:2379,tcp:2380,tcp:4001,udp:4789,udp:5404,udp:5405,tcp:8053,udp:8053,tcp:8444,tcp:10250,tcp:10255,udp:10255,tcp:24224,udp:24224  ocp       ocp-master", "NAME                                  NETWORK                             SRC_RANGES  RULES                                   SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-node-internal  ci-prtest-5a37c28-6439-ocp-network              udp:4789,tcp:10250,tcp:10255,udp:10255  ocp       ocp-node,ocp-infra-node", "NAME                                        NETWORK                             SRC_RANGES  RULES                                                    SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-infra-node-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:80,tcp:443,tcp:1936,tcp:30000-32000,udp:30000-32000            ocp-infra-node", "NAME                                 NETWORK                             SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS", "ci-prtest-5a37c28-6439-ssh-external  ci-prtest-5a37c28-6439-ocp-network  0.0.0.0/0   tcp:22", "NAME                                             MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-master  n1-standard-2               2017-08-23T09:58:48.326-07:00", "NAME                                           MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-node  n1-standard-2               2017-08-23T09:58:48.262-07:00", "NAME                                                 MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP", "ci-prtest-5a37c28-6439-instance-template-node-infra  n1-standard-2               2017-08-23T09:58:48.210-07:00", "NAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                              AUTOSCALED", "ci-prtest-5a37c28-6439-ig-n  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-n  0     3            ci-prtest-5a37c28-6439-instance-template-node  no", "NAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                                AUTOSCALED", "ci-prtest-5a37c28-6439-ig-m  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-m  0     1            ci-prtest-5a37c28-6439-instance-template-master  no", "NAME                         LOCATION       SCOPE  BASE_INSTANCE_NAME           SIZE  TARGET_SIZE  INSTANCE_TEMPLATE                                    AUTOSCALED", "ci-prtest-5a37c28-6439-ig-i  us-central1-a  zone   ci-prtest-5a37c28-6439-ig-i  0     0            ci-prtest-5a37c28-6439-instance-template-node-infra  no", "NAME                                     ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-38r7-docker  us-central1-a  75       pd-ssd  READY", "NAME                                     ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-m-4644-docker  us-central1-a  75       pd-ssd  READY", "NAME                                     ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-t26c-docker  us-central1-a  75       pd-ssd  READY", "NAME                                     ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-9ctw-docker  us-central1-a  75       pd-ssd  READY", "NAME                                        ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-38r7-openshift  us-central1-a  50       pd-ssd  READY", "NAME                                        ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-m-4644-openshift  us-central1-a  50       pd-ssd  READY", "NAME                                        ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-9ctw-openshift  us-central1-a  50       pd-ssd  READY", "NAME                                        ZONE           SIZE_GB  TYPE    STATUS", "ci-prtest-5a37c28-6439-ig-n-t26c-openshift  us-central1-a  50       pd-ssd  READY"]}

PLAY RECAP *********************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=1   

Wednesday 23 August 2017  16:59:46 +0000 (0:02:17.379)       0:02:20.148 ****** 
=============================================================================== 
provision : Provision GCE resources ----------------------------------- 137.38s
provision : Provision GCE DNS domain ------------------------------------ 1.52s
provision : Templatize DNS script --------------------------------------- 0.51s
provision : Templatize provision script --------------------------------- 0.39s
provision : Ensure that DNS resolves to the hosted zone ----------------- 0.08s

Failure summary:

  1. Host:     localhost
     Play:     Ensure all cloud resources necessary for the cluster, including instances, have been started
     Task:     provision : Provision GCE resources
     Message:  ???

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 23, 2017 via email

@enj
Copy link
Contributor

enj commented Aug 23, 2017

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 23, 2017 via email

@sdodson sdodson assigned smarterclayton and unassigned sdodson Aug 24, 2017
@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 24, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 24, 2017 via email

@simo5
Copy link
Contributor

simo5 commented Aug 31, 2017

Still seeing this:

TASK [gce-provision : Provision GCE resources] *********************************
Wednesday 30 August 2017  19:11:04 +0000 (0:00:00.076)       0:00:02.751 ****** 
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/tmp/provision.sh"], "delta": "0:00:51.272819", "end": "2017-08-30 19:11:55.974629", "failed": true, "rc": 1, "start": "2017-08-30 19:11:04.701810", "stderr": "ERROR: (gcloud.compute.networks.create) Could not fetch resource:\n - Quota 'ROUTES' exceeded.  Limit: 300.0", "stderr_lines": ["ERROR: (gcloud.compute.networks.create) Could not fetch resource:", " - Quota 'ROUTES' exceeded.  Limit: 300.0"], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=1   

Wednesday 30 August 2017  19:11:55 +0000 (0:00:51.389)       0:00:54.140 ****** 
=============================================================================== 
gce-provision : Provision GCE resources -------------------------------- 51.39s
gce-provision : Provision GCE DNS domain -------------------------------- 1.59s
gce-provision : Templatize DNS script ----------------------------------- 0.51s
gce-provision : Templatize provision script ----------------------------- 0.38s
gce-provision : Ensure that DNS resolves to the hosted zone ------------- 0.08s

Failure summary:

  1. Host:     localhost
     Play:     Ensure all cloud resources necessary for the cluster, including instances, have been started
     Task:     gce-provision : Provision GCE resources
     Message:  ???
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace

@stevekuznetsov
Copy link
Contributor

  • Quota 'ROUTES' exceeded

We had some issues. Resolved today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0
Projects
None yet
Development

No branches or pull requests

10 participants