Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] [sig-apps] ReplicaSet should serve a basic image on each replica with a private image, ReplicationController should serve a basic image on each replica with a private image #97002

Closed
thejoycekung opened this issue Dec 2, 2020 · 21 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Milestone

Comments

@thejoycekung
Copy link
Contributor

thejoycekung commented Dec 2, 2020

Which jobs are failing:
ci-kubernetes-e2e-gci-gce
ci-kubernetes-e2e-gce-cos-k8sbeta-default

Which test(s) are failing:
[sig-apps] ReplicaSet should serve a basic image on each replica with a private image
[sig-apps] ReplicationController should serve a basic image on each replica with a private image

Since when has it been failing:
Started failing between 2:04 and 2:40PM PST Dec 1

Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-cos-master-default
https://k8s-testgrid.appspot.com/sig-release-1.20-blocking#gce-cos-k8sbeta-default

Reason for failure:
pod never run? Looks like both are timing out waiting for containers to be ready

ReplicaSet should serve a basic image on each replica with a private image:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/replica_set.go:98
Dec  1 22:54:13.321: Unexpected error:
    <*errors.errorString | 0xc0036f8ef0>: {
        s: "pod \"my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320-rknrl\" never run (phase: Pending, conditions: [{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason: Message:}]): timed out waiting for the condition",
    }
    pod "my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320-rknrl" never run (phase: Pending, conditions: [{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-cd2ec0df-be38-465e-a00f-f868f9674320]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 22:49:07 +0000 UTC Reason: Message:}]): timed out waiting for the condition
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/replica_set.go:156

ReplicationController should serve a basic image on each replica with a private image

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/rc.go:68
Dec  1 23:07:02.794: Unexpected error:
    <*errors.errorString | 0xc00348b1f0>: {
        s: "pod \"my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7-xz94v\" never run (phase: Pending, conditions: [{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason: Message:}]): timed out waiting for the condition",
    }
    pod "my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7-xz94v" never run (phase: Pending, conditions: [{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [my-hostname-private-3071f600-7524-41d9-b7ea-f7a5cf5011e7]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-12-01 23:02:02 +0000 UTC Reason: Message:}]): timed out waiting for the condition
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/rc.go:459

Anything else we need to know:

Example Spyglass links:

Having trouble finding a good Triage link - will drop one if I can find

Wondering whether this has anything to do with the Pod pending timeout errors happening on some of the jobs on the 1.20 boards now?

/sig apps
/cc @kubernetes/ci-signal @kubernetes/sig-apps-test-failures

@thejoycekung thejoycekung added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Dec 2, 2020
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 2, 2020
@k8s-ci-robot
Copy link
Contributor

@thejoycekung: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 2, 2020
@aojea
Copy link
Member

aojea commented Dec 2, 2020

It seems this is the problem failed to resolve reference "gcr.io/k8s-authenticated-test/agnhost:2.6": failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden

Dec 1 22:58:38.068: INFO: At 2020-12-01 22:53:38 +0000 UTC - event for my-hostname-private-1b815588-3c0e-49b2-bad4-77d652224eb8-tqctm: {kubelet bootstrap-e2e-minion-group-6s0c} Failed: Failed to pull image "gcr.io/k8s-authenticated-test/agnhost:2.6": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/k8s-authenticated-test/agnhost:2.6": failed to resolve reference "gcr.io/k8s-authenticated-test/agnhost:2.6": failed to authorize: failed to fetch oauth token: unexpected status: 403 Forbidden

@RobertKielty
Copy link
Member

@thejoycekung
Copy link
Contributor Author

Now also affecting:

  • ci-kubernetes-e2e-ubuntu-gce-containerd
  • ci-kubernetes-e2e-ubuntu-gce

@spiffxp
Copy link
Member

spiffxp commented Dec 2, 2020

https://kubernetes.slack.com/archives/C09QZ4DQB/p1606896985218000

The project hosting the GCR repo was swept up by a security audit because it hadn't been properly accounted for. That change has been reverted. Now waiting to see affected jobs go back to green.

We should create a community-owned equivalent project, I'll open a followup issue for that

@spiffxp
Copy link
Member

spiffxp commented Dec 2, 2020

/assign @BenTheElder @spiffxp
Since ownership of the project got transferred to us

@kubernetes/ci-signal I feel like this should be assigned to someone from CI signal to track the jobs going green, what's your policy for that?

@justaugustus
Copy link
Member

@spiffxp -- CI Signal should continue to monitor.

I think since @krzyzacy restored the project, y'all are in the clear for the time being. 🙃

/assign @justaugustus @hasheddan
(Dan and I will be watching from the shadows. :))

@justaugustus
Copy link
Member

We should create a community-owned equivalent project, I'll open a followup issue for that

@spiffxp -- Opened one here: kubernetes/k8s.io#1458

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 3, 2020

Hrm, I'm seeing this fail still in downstream repo tests. Are the tests injecting a secret (the only hardcoded GCR secret I see is in k8s.io/kubernetes/test/e2e/common/runtime.go but that is not called by those referenced tests), or is the auth rule on this repo limited to a set of projects now vs all projects on GCP before (since these tests passed in our GCP projects yesterday but not now, after access was supposedly restored)?

Who is able to access that repo? If it was previously "all projects" then I think that wasn't restored correctly. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/24887/pull-ci-openshift-origin-master-e2e-gcp/1334318300561149952 is a 1.19 codebase trying to run in the openshift-gce-devel-ci GCP project, but is getting access denied.

EDIT: This looks like it has started passing again at midnight EST? Maybe some sort of wierd perms propagation issue. DISREGARD

@justaugustus
Copy link
Member

I'll leave it to Ben or Aaron to report on what permissions are configured on the repo (as they now have access to it).

@hasheddan
Copy link
Contributor

As best as I can tell here, we aren't using any imagePullSecrets for the Pods

ginkgo.It("should serve a basic image on each replica with a private image", func() {

So that likely means that the service account that we are mounting has the proper credentials attached.

@hasheddan
Copy link
Contributor

@spiffxp I think we likely just need to add permissions to [email protected] to access the bucket in the restored project where the GCR images are hosted.

@neolit123
Copy link
Member

i confirm that these two tests are flaking a lot and blocking presubmits
xref
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-ubuntu-containerd
#97016

@spiffxp
Copy link
Member

spiffxp commented Dec 3, 2020

Now that I have access to the project, I'm working on restoring permissions. I had hoped this would be a 10min fix, but it's taking longer than I expected. I can currently list the backing bucket, but cannot list images.

@spiffxp
Copy link
Member

spiffxp commented Dec 3, 2020

In the event that I can't get this done expediently for release, I think the release team should consider some alternatives.

In order of my personal preference

  1. disable the tests for 1.20/master by adding a [Feature:TemporarilyDisabled] tag (quickest/minimal change)
  2. setup a new registry and migrate the tests to that (cherry pick to all release branches; downstream users may remain broken)
  3. ignore the failures as known problems (now you're not releasing on green)

@spiffxp
Copy link
Member

spiffxp commented Dec 3, 2020

OK, thanks to some help from @amwat I am now able to list images with my google account

$ gcloud container images list --repository=gcr.io/k8s-authenticated-test
NAME
gcr.io/k8s-authenticated-test/agnhost
gcr.io/k8s-authenticated-test/agnhost-amd64
gcr.io/k8s-authenticated-test/agnhost-arm
gcr.io/k8s-authenticated-test/agnhost-arm64
gcr.io/k8s-authenticated-test/agnhost-ppc64le
gcr.io/k8s-authenticated-test/agnhost-s390x
gcr.io/k8s-authenticated-test/serve-hostname
gcr.io/k8s-authenticated-test/serve-hostname-amd64
gcr.io/k8s-authenticated-test/serve-hostname-arm
gcr.io/k8s-authenticated-test/serve-hostname-arm64
gcr.io/k8s-authenticated-test/serve-hostname-ppc64le
gcr.io/k8s-authenticated-test/serve-hostname-s390x
gcr.io/k8s-authenticated-test/serve_hostname
gcr.io/k8s-authenticated-test/serve_hostname-amd64
gcr.io/k8s-authenticated-test/serve_hostname-arm
gcr.io/k8s-authenticated-test/serve_hostname-arm64
gcr.io/k8s-authenticated-test/serve_hostname-ppc64le
gcr.io/k8s-authenticated-test/serve_hostname-s390x

@spiffxp
Copy link
Member

spiffxp commented Dec 3, 2020

There are no specific permissions on the bucket for a service account

Without docker configured for auth:

$ docker pull gcr.io/k8s-authenticated-test/agnhost:2.5
Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

Configuring docker to use my personal gcloud account for auth (which has no explicit permissions to this registry):

$ gcloud config set account REDACTED
$ gcloud auth configure-docker
$ docker pull gcr.io/k8s-authenticated-test/agnhost:2.5
# ...
Digest: sha256:e9f40b11ae4fca95496f97daee5553e8786002dc4639251759aa934e8ea601d3
Status: Downloaded newer image for gcr.io/k8s-authenticated-test/agnhost:2.5

So this may be enough to unblock tests

@spiffxp
Copy link
Member

spiffxp commented Dec 3, 2020

I'm starting to see passes for presubmits that were affected by this, looking at https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&job=pull-kubernetes-e2e-gce-ubuntu-containerd

e.g. https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/97020/pull-kubernetes-e2e-gce-ubuntu-containerd/1334589100509892608/

@spiffxp
Copy link
Member

spiffxp commented Dec 4, 2020

I'm questioning whether we even want to keep these tests around (ref #97026 (comment)) but I don't think that needs to happen for v1.20.0

@thejoycekung
Copy link
Contributor Author

thejoycekung commented Dec 10, 2020

closing this one now since testgrid has been green for the past week ish
/close

@k8s-ci-robot
Copy link
Contributor

@thejoycekung: Closing this issue.

In response to this:

closing this one now since testgrid has been green for the past few days

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

10 participants