Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg: Pin to RHCOS 47.198 and quay.io/openshift-release-dev/ocp-release:4.0.0-4 #848

Closed
wants to merge 2 commits into from

Conversation

wking
Copy link
Member

@wking wking commented Dec 8, 2018

DO NOT MERGE!

That's the latest RHCOS release:

$ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/builds.json | jq '{latest: .builds[0], timestamp}'
{
  "latest": "47.198",
  "timestamp": "2018-12-08T23:13:22Z"
}

And @smarterclayton just pushed 4.0.0-alpha.0-2018-12-07-090414 to quay.io/openshift-release-dev/ocp-release:4.0.0-4. That's not the most recent release, but it's the most-recent stable release ;).

Renaming OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE gets us CI testing of the pinned release despite openshift/release@60007df2 (openshift/release#1793).

Through f7d6d29 (Merge pull request openshift#806 from
sallyom/log-url-clarify-pw, 2018-12-07).
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2018
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 8, 2018
@wking wking force-pushed the version-0.6.0-pins branch from 5a49c4e to d108128 Compare December 8, 2018 23:39
@wking
Copy link
Member Author

wking commented Dec 8, 2018

This cherry-picks #773 onto #841 and bumps the pinned versions. No need to merge this, just chime in with yea/nay or whatever ;). We should get past the recent CI blockages via the pinned, older update payload.

Copy link
Contributor

@crawford crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smarterclayton
Copy link
Contributor

I can live with the two failures there, there are other passing router tests.

/retest

@wking
Copy link
Member Author

wking commented Dec 9, 2018

The errors from the previous e2e-aws were:

fail [github.com/openshift/origin/test/extended/router/stress.go:176]: Expected error:
    <*errors.errorString | 0xc4216a8230>: {
        s: "replicaset \"router\" never became ready",
    }
    replicaset "router" never became ready
not to have occurred
...
failed: (3m35s) 2018-12-09T00:13:07 "[Conformance][Area:Networking][Feature:Router] The HAProxy router converges when multiple routers are writing conflicting status [Suite:openshift/conformance/parallel/minimal] [Suite:openshift/smoke-4]"

and:

fail [github.com/openshift/origin/test/extended/router/stress.go:90]: Expected error:
    <*errors.errorString | 0xc421fc8b00>: {
        s: "replicaset \"router\" never became ready",
    }
    replicaset "router" never became ready
not to have occurred
...
failed: (3m21s) 2018-12-09T00:18:39 "[Conformance][Area:Networking][Feature:Router] The HAProxy router converges when multiple routers are writing status [Suite:openshift/conformance/parallel/minimal] [Suite:openshift/smoke-4]"

@wking
Copy link
Member Author

wking commented Dec 9, 2018

And it's still working its way through teardown, but job 2082 failed the same two test with the same two reported errors. So I suspect it's a real issue (although perhaps a peripheral one) and not a flake. I have journalctl dumps from the three masters for this run; I'll go through and see if anything suspicious is mentioned around the time of the two failures.

@wking
Copy link
Member Author

wking commented Dec 9, 2018

So the two replicaset "router" never became ready errors from job 2082 were at 05:58:24 and 05:59:01. My logs from master-0 have:

Dec 09 05:58:13 ip-10-0-13-92 sshd[32430]: PAM 5 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=101.72.24.174
Dec 09 05:58:13 ip-10-0-13-92 sshd[32430]: PAM service(sshd) ignoring max retries; 6 > 3
Dec 09 05:58:23 ip-10-0-13-92 hyperkube[4119]: E1209 05:58:23.439962    4119 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"echoserver-sourceip.156e9498265d35a8", GenerateName:"", Namespace:"e2e-tests-services-tr59j", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-services-tr59j", Name:"echoserver-sourceip", UID:"519e82f7-fb77-11e8-b7fb-125c3d3368ba", APIVersion:"v1", ResourceVersion:"46737", FieldPath:"spec.containers{echoserver}"}, Reason:"Killing", Message:"Killing container with id cri-o://echoserver:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"ip-10-0-13-92.ec2.internal"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbefb499fd9e1ffa8, ext:1615801035461, loc:(*time.Location)(0x9061c80)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbefb499fd9e1ffa8, ext:1615801035461, loc:(*time.Location)(0x9061c80)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "echoserver-sourceip.156e9498265d35a8" is forbidden: unable to create new content in namespace e2e-tests-services-tr59j because it is being terminated' (will not retry!)
Dec 09 05:58:23 ip-10-0-13-92 hyperkube[4119]: E1209 05:58:23.498490    4119 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/containers/storage/overlay/b97f5f5fcabe7c8f8367cf3f42c5d12b47e4cd47bf320f07aca3bd3c1add28cb/diff with output stdout: , stderr: du: cannot access ‘/var/lib/containers/storage/overlay/b97f5f5fcabe7c8f8367cf3f42c5d12b47e4cd47bf320f07aca3bd3c1add28cb/diff’: No such file or directory
Dec 09 05:58:23 ip-10-0-13-92 hyperkube[4119]: - exit status 1, rootInodeErr: cmd [ionice -c3 nice -n 19 find /var/lib/containers/storage/overlay/b97f5f5fcabe7c8f8367cf3f42c5d12b47e4cd47bf320f07aca3bd3c1add28cb/diff -xdev -printf .] failed. stderr: find: ‘/var/lib/containers/storage/overlay/b97f5f5fcabe7c8f8367cf3f42c5d12b47e4cd47bf320f07aca3bd3c1add28cb/diff’: No such file or directory
Dec 09 05:58:23 ip-10-0-13-92 hyperkube[4119]: ; err: exit status 1, extraDiskErr: du command failed on /var/log/pods/4ada9195-fb77-11e8-b7fb-125c3d3368ba/nginx/0.log with output stdout: , stderr: du: cannot access ‘/var/log/pods/4ada9195-fb77-11e8-b7fb-125c3d3368ba/nginx/0.log’: No such file or directory
Dec 09 05:58:23 ip-10-0-13-92 hyperkube[4119]: - exit status 1
Dec 09 05:58:23 ip-10-0-13-92 kernel: device veth98832a16 left promiscuous mode
Dec 09 05:58:24 ip-10-0-13-92 hyperkube[4119]: I1209 05:58:24.610336    4119 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-qbx76" (UniqueName: "kubernetes.io/secret/519e82f7-fb77-11e8-b7fb-125c3d3368ba-default-token-qbx76") pod "519e82f7-fb77-11e8-b7fb-125c3d3368ba" (UID: "519e82f7-fb77-11e8-b7fb-125c3d3368ba")
Dec 09 05:58:24 ip-10-0-13-92 hyperkube[4119]: I1209 05:58:24.620722    4119 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/519e82f7-fb77-11e8-b7fb-125c3d3368ba-default-token-qbx76" (OuterVolumeSpecName: "default-token-qbx76") pod "519e82f7-fb77-11e8-b7fb-125c3d3368ba" (UID: "519e82f7-fb77-11e8-b7fb-125c3d3368ba"). InnerVolumeSpecName "default-token-qbx76". PluginName "kubernetes.io/secret", VolumeGidValue ""
Dec 09 05:58:24 ip-10-0-13-92 hyperkube[4119]: I1209 05:58:24.710863    4119 reconciler.go:301] Volume detached for volume "default-token-qbx76" (UniqueName: "kubernetes.io/secret/519e82f7-fb77-11e8-b7fb-125c3d3368ba-default-token-qbx76") on node "ip-10-0-13-92.ec2.internal" DevicePath ""
Dec 09 05:58:24 ip-10-0-13-92 systemd[1]: Removed slice libcontainer container kubepods-besteffort-pod519e82f7_fb77_11e8_b7fb_125c3d3368ba.slice.
Dec 09 05:58:35 ip-10-0-13-92 hyperkube[4119]: E1209 05:58:35.980350    4119 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/containers/storage/overlay/00065b4bb2b49b3bac8562bc8c7bc173003381d5b37ae45aad79e2ac6a9bfa42/diff with output stdout: , stderr: du: cannot access ‘/var/lib/containers/storage/overlay/00065b4bb2b49b3bac8562bc8c7bc173003381d5b37ae45aad79e2ac6a9bfa42/diff’: No such file or directory
Dec 09 05:58:35 ip-10-0-13-92 hyperkube[4119]: - exit status 1, rootInodeErr: cmd [ionice -c3 nice -n 19 find /var/lib/containers/storage/overlay/00065b4bb2b49b3bac8562bc8c7bc173003381d5b37ae45aad79e2ac6a9bfa42/diff -xdev -printf .] failed. stderr: find: ‘/var/lib/containers/storage/overlay/00065b4bb2b49b3bac8562bc8c7bc173003381d5b37ae45aad79e2ac6a9bfa42/diff’: No such file or directory
Dec 09 05:58:35 ip-10-0-13-92 hyperkube[4119]: ; err: exit status 1, extraDiskErr: du command failed on /var/log/pods/519e82f7-fb77-11e8-b7fb-125c3d3368ba/echoserver/0.log with output stdout: , stderr: du: cannot access ‘/var/log/pods/519e82f7-fb77-11e8-b7fb-125c3d3368ba/echoserver/0.log’: No such file or directory
Dec 09 05:58:35 ip-10-0-13-92 hyperkube[4119]: - exit status 1
Dec 09 06:08:19 ip-10-0-13-92 systemd[1]: Found ordering cycle on local-fs.target/stop

master-1 has:

Dec 09 05:57:49 ip-10-0-29-127 hyperkube[4124]: E1209 05:57:49.612584    4124 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"netserver
-0.156e949043eda45e", GenerateName:"", Namespace:"e2e-tests-pod-network-test-r6lgk", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Lo
cation)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializ
ers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pod-network-test-r6lgk", Name:"netserver-0", UID:"40bc99e7-fb77-11e8-8a4
7-0a42d33f8e78", APIVersion:"v1", ResourceVersion:"44559", FieldPath:"spec.containers{webserver}"}, Reason:"Killing", Message:"Killing container with id cri-o://webserver:Need to kill Pod", Source:v1.EventSource
{Component:"kubelet", Host:"ip-10-0-29-127.ec2.internal"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbefb49976201425e, ext:1582427716889, loc:(*time.Location)(0x9061c80)}}, LastTimestamp:v1.Time{Time:time.Tim
e{wall:0xbefb49976223028f, ext:1582429928720, loc:(*time.Location)(0x9061c80)}}, Count:2, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSerie
s)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "netserver-0.156e949043eda45e" is forbidden: unable to create new content in namespace e2e-tests-po
d-network-test-r6lgk because it is being terminated' (will not retry!)
Dec 09 05:57:49 ip-10-0-29-127 hyperkube[4124]: I1209 05:57:49.624872    4124 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-vzfpn" (UniqueName: "kubernetes.io/secret/40bc99
e7-fb77-11e8-8a47-0a42d33f8e78-default-token-vzfpn") pod "40bc99e7-fb77-11e8-8a47-0a42d33f8e78" (UID: "40bc99e7-fb77-11e8-8a47-0a42d33f8e78")
Dec 09 05:57:49 ip-10-0-29-127 hyperkube[4124]: I1209 05:57:49.636750    4124 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/40bc99e7-fb77-11e8-8a47-0a42d33f8e78-de
fault-token-vzfpn" (OuterVolumeSpecName: "default-token-vzfpn") pod "40bc99e7-fb77-11e8-8a47-0a42d33f8e78" (UID: "40bc99e7-fb77-11e8-8a47-0a42d33f8e78"). InnerVolumeSpecName "default-token-vzfpn". PluginName "ku
bernetes.io/secret", VolumeGidValue ""
Dec 09 05:57:49 ip-10-0-29-127 kernel: device veth172007c3 left promiscuous mode
Dec 09 05:57:49 ip-10-0-29-127 hyperkube[4124]: I1209 05:57:49.725433    4124 reconciler.go:301] Volume detached for volume "default-token-vzfpn" (UniqueName: "kubernetes.io/secret/40bc99e7-fb77-11e8-8a47-0a42d3
3f8e78-default-token-vzfpn") on node "ip-10-0-29-127.ec2.internal" DevicePath ""
Dec 09 05:57:50 ip-10-0-29-127 hyperkube[4124]: W1209 05:57:50.060836    4124 pod_container_deletor.go:75] Container "eed06008b22a4e0dbee86719690513b8f1c6a904e66060ed39c0639859d8e4ee" not found in pod's containers
Dec 09 05:57:50 ip-10-0-29-127 hyperkube[4124]: W1209 05:57:50.880066    4124 kubelet_getters.go:264] Path "/var/lib/kubelet/pods/40bc99e7-fb77-11e8-8a47-0a42d33f8e78/volumes" does not exist
Dec 09 05:57:50 ip-10-0-29-127 systemd[1]: Removed slice libcontainer container kubepods-besteffort-pod40bc99e7_fb77_11e8_8a47_0a42d33f8e78.slice.
Dec 09 05:57:50 ip-10-0-29-127 hyperkube[4124]: E1209 05:57:50.904732    4124 kuberuntime_container.go:65] Can't make a ref to pod "netserver-0_e2e-tests-pod-network-test-r6lgk(40bc99e7-fb77-11e8-8a47-0a42d33f8e
78)", container webserver: selfLink was empty, can't make reference
Dec 09 05:58:14 ip-10-0-29-127 hyperkube[4124]: W1209 05:58:14.172269    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/olm-operator-75f785f98b-55dgs due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 05:59:00 ip-10-0-29-127 hyperkube[4124]: W1209 05:59:00.170282    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/catalog-operator-5499796c76-72mlv due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 05:59:24 ip-10-0-29-127 hyperkube[4124]: W1209 05:59:24.169851    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/olm-operator-75f785f98b-55dgs due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 06:00:17 ip-10-0-29-127 hyperkube[4124]: W1209 06:00:17.171568    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/catalog-operator-5499796c76-72mlv due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 06:00:46 ip-10-0-29-127 hyperkube[4124]: W1209 06:00:46.169641    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/olm-operator-75f785f98b-55dgs due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 06:01:41 ip-10-0-29-127 hyperkube[4124]: W1209 06:01:41.170521    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/catalog-operator-5499796c76-72mlv due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.

and, for reasons that are not clear to me, my master-2 logs end with:

Dec 09 05:57:50 ip-10-0-44-91 hyperkube[4107]: W1209 05:57:50.936783    4107 pod_container_deletor.go:75] Container "44acd41e1d9f5d0a766928c73c19138b85068647cfddfd7ebe76074c94af3673" not found in pod's containers

Maybe they went too quiet and my ssh core@... journalctl -f | tee master-2 connection was dropped.

The most concerning entries are secrets "coreos-pull-secret" not found, although I'm not familiar with the tests, maybe that's expected occasionally. Just in case, here are brackets on that issue:

$ grep -h 'pull.*secret' master-* | sort | head -n2
Dec 09 05:41:25 ip-10-0-29-127 hyperkube[4124]: W1209 05:41:25.939760    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/catalog-operator-5499796c76-72mlv due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 05:41:36 ip-10-0-29-127 hyperkube[4124]: W1209 05:41:36.213621    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/olm-operator-75f785f98b-55dgs due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
$ grep -h 'pull.*secret' master-* | sort | tail -n2
Dec 09 06:07:30 ip-10-0-29-127 hyperkube[4124]: W1209 06:07:30.170277    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/olm-operator-75f785f98b-55dgs due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.
Dec 09 06:07:48 ip-10-0-29-127 hyperkube[4124]: W1209 06:07:48.171280    4124 kubelet_pods.go:841] Unable to retrieve pull secret openshift-operator-lifecycle-manager/coreos-pull-secret for openshift-operator-lifecycle-manager/catalog-operator-5499796c76-72mlv due to secrets "coreos-pull-secret" not found.  The image pull may not succeed.

And those errors have also been reported over in operator-framework/operator-lifecycle-manager#607.

@wking
Copy link
Member Author

wking commented Dec 9, 2018

While there are multiple HAProxy tests in the e2e suite, the ones failing are the only two stress tests:

$ git config remote.origin.url
https://github.com/openshift/origin.git
$ git describe --dirty
v4.0.0-alpha.0-759-g9d2874f
$ git grep '\.It(' test/extended/router/stress.go
test/extended/router/stress.go:		g.It("converges when multiple routers are writing status", func() {
test/extended/router/stress.go:		g.It("converges when multiple routers are writing conflicting status", func() {

They're also the only two that use waitForReadyReplicaSet, which is where we're seeing the failure. I'll run again and try to watch that replica set. And once we green up CI, someone should file a PR dumping the replica set's status, because "never became ready" isn't as helpful as "currently has status $STATUS for $REASONS" :p.

The event log also looks clean for this replica set (since that would mitigate the generic log mesage. But we only have:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/848/pull-ci-openshift-installer-master-e2e-aws/2082/artifacts/e2e-aws/events.json | jq '.items[] | select(.involvedObject.kind == "ReplicaSet" and (.involvedObject.name | contains("router")))'
{
  "apiVersion": "v1",
  "count": 1,
  "eventTime": null,
  "firstTimestamp": "2018-12-09T05:41:29Z",
  "involvedObject": {
    "apiVersion": "apps/v1",
    "kind": "ReplicaSet",
    "name": "router-default-6c45f76f75",
    "namespace": "openshift-ingress",
    "resourceVersion": "7040",
    "uid": "13d07968-fb75-11e8-b05a-125c3d3368ba"
  },
  "kind": "Event",
  "lastTimestamp": "2018-12-09T05:41:29Z",
  "message": "Created pod: router-default-6c45f76f75-4g9t7",
  "metadata": {
    "creationTimestamp": "2018-12-09T05:41:29Z",
    "name": "router-default-6c45f76f75.156e93ac29775a8b",
    "namespace": "openshift-ingress",
    "resourceVersion": "7067",
    "selfLink": "/api/v1/namespaces/openshift-ingress/events/router-default-6c45f76f75.156e93ac29775a8b",
    "uid": "13ec0643-fb75-11e8-b05a-125c3d3368ba"
  },
  "reason": "SuccessfulCreate",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "replicaset-controller"
  },
  "type": "Normal"
}

which looks fine.

/retest

@wking
Copy link
Member Author

wking commented Dec 9, 2018

Job 2084 crashed and burned:

Error: 249 fail, 14 pass, 63 skip (18m4s)

possibly due to an Kubernetes API server installer being OOMed:

$ KUBECONFIG=kubeconfig oc get pods --all-namespaces | grep OOM
openshift-kube-apiserver                                  installer-2-ip-10-0-23-167.ec2.internal                           0/1       OOMKilled     0          19m
$ KUBECONFIG=kubeconfig oc get pod -o yaml -n openshift-kube-apiserver installer-2-ip-10-0-23-167.ec2.internal
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-12-09T08:47:19Z
  labels:
    app: installer
  name: installer-2-ip-10-0-23-167.ec2.internal
  namespace: openshift-kube-apiserver
  resourceVersion: "8997"
  selfLink: /api/v1/namespaces/openshift-kube-apiserver/pods/installer-2-ip-10-0-23-167.ec2.internal
  uid: 095a065f-fb8f-11e8-afd9-0a80660a673e
spec:
  containers:
  - args:
    - -v=4
    - --revision=2
    - --namespace=openshift-kube-apiserver
    - --pod=kube-apiserver-pod
    - --resource-dir=/etc/kubernetes/static-pod-resources
    - --pod-manifest-dir=/etc/kubernetes/manifests
    - --configmaps=kube-apiserver-pod
    - --configmaps=config
    - --configmaps=aggregator-client-ca
    - --configmaps=client-ca
    - --configmaps=etcd-serving-ca
    - --configmaps=kubelet-serving-ca
    - --configmaps=sa-token-signing-certs
    - --secrets=aggregator-client
    - --secrets=etcd-client
    - --secrets=kubelet-client
    - --secrets=serving-cert
    command:
    - cluster-kube-apiserver-operator
    - installer
    image: quay.io/openshift-release-dev/ocp-v4.0@sha256:a38d0e240ab50573d5193eec7ecf6046b4b0860f8b2184d34c4ad648f020333e
    imagePullPolicy: Always
    name: installer
    resources: {}
    securityContext:
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/kubernetes/
      name: kubelet-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: installer-sa-token-scmj9
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: installer-sa-dockercfg-p4tf6
  nodeName: ip-10-0-23-167.ec2.internal
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext:
    runAsUser: 0
  serviceAccount: installer-sa
  serviceAccountName: installer-sa
  terminationGracePeriodSeconds: 30
  volumes:
  - hostPath:
      path: /etc/kubernetes/
      type: ""
    name: kubelet-dir
  - name: installer-sa-token-scmj9
    secret:
      defaultMode: 420
      secretName: installer-sa-token-scmj9
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-12-09T08:47:19Z
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-12-09T08:47:19Z
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2018-12-09T08:47:19Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://cfd8efde622ee86e28e89f58cb9d4850bbe2b1198d74b6937f4865ee013aa7ed
    image: quay.io/openshift-release-dev/ocp-v4.0@sha256:a38d0e240ab50573d5193eec7ecf6046b4b0860f8b2184d34c4ad648f020333e
    imageID: quay.io/openshift-release-dev/ocp-v4.0@sha256:a38d0e240ab50573d5193eec7ecf6046b4b0860f8b2184d34c4ad648f020333e
    lastState: {}
    name: installer
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: cri-o://cfd8efde622ee86e28e89f58cb9d4850bbe2b1198d74b6937f4865ee013aa7ed
        exitCode: 0
        finishedAt: 2018-12-09T08:47:22Z
        reason: OOMKilled
        startedAt: 2018-12-09T08:47:21Z
  hostIP: 10.0.23.167
  phase: Succeeded
  podIP: 10.128.0.21
  qosClass: BestEffort
  startTime: 2018-12-09T08:47:19Z

@wking
Copy link
Member Author

wking commented Dec 9, 2018

Also, it looks like 4.0.0-4 will be impacted by openshift/machine-config-operator#225, because:

$ oc adm release info quay.io/openshift-release-dev/ocp-release:4.0.0-4 --commits | grep dns-operator
  cluster-dns-operator    https://github.com/openshift/cluster-dns-operator    119e58d03d441282a764ba51619536f5a7c4ded8

As discussed in openshift/cluster-dns-operator#63, the /etc/hosts contention is from openshift/cluster-dns-operator#56. And:

$ git log --graph --oneline -8 119e58d03d441
*   119e58d Merge pull request #61 from ironcladlou/registry-hosts-fqdn
|\  
| * 7f9b291 Fix uninstall script
| * 3e6c3aa Support both relative and absolute service names
* |   9b78211 Merge pull request #59 from sosiouxme/patch-1
|\ \  
| |/  
|/|   
| * d2fedf1 Makefile: don't specify GOARCH
* |   03223f1 Merge pull request #60 from Miciah/fix-registry-service-name
|\ \  
| |/  
|/|   
| * 6c533d9 Fix registry service name
|/  
*   4c2aed1 Merge pull request #56 from pravisankar/reg-node-resolver
|\  

So until we fix that, we may be stuck on 4.0.0-3. On the other hand, the image-registry operator conflict @abhinavdahiya points out in openshift/machine-config-operator#225 is from way back in openshift/cluster-image-registry-operator#72. So maybe we just need to live with occasional MCD-dirty-file reboots until we get openshift/machine-config-operator#225 landed?

pkg/rhcos/builds.go Outdated Show resolved Hide resolved
…e:4.0.0-4

That's the latest RHCOS release:

  $ curl -s https://releases-rhcos.svc.ci.openshift.org/storage/releases/maipo/builds.json | jq '{latest: .builds[0], timestamp}'
  {
    "latest": "47.198",
    "timestamp": "2018-12-08T23:13:22Z"
  }

And Clayton just pushed 4.0.0-alpha.0-2018-12-07-090414 to
quay.io/openshift-release-dev/ocp-release:4.0.0-4.  That's not the
most recent release, but it's the most-recent stable release ;).

Renaming OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE gets us CI testing
of the pinned release despite openshift/release@60007df2 (Use
RELEASE_IMAGE_LATEST for CVO payload, 2018-10-03,
openshift/release#1793).
@wking wking force-pushed the version-0.6.0-pins branch from d108128 to d54e597 Compare December 10, 2018 15:09
@crawford
Copy link
Contributor

/retest

@crawford crawford force-pushed the master-0.6.0 branch 3 times, most recently from 51a43d3 to 889b3c4 Compare December 10, 2018 16:44
@wking
Copy link
Member Author

wking commented Dec 10, 2018

We ended up pushing this out via a Git branch without bothering with this PR ;).

@wking wking closed this Dec 10, 2018
@wking wking deleted the version-0.6.0-pins branch December 11, 2018 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants