-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E test images: httpd images failed to push to staging #20884
Comments
/assign @spiffxp |
In case the problem is with the k8s-staging-e2e-test-images project, I looked at other jobs that push to that
No job other than http have run since agnhost last passed, so it's tough to isolate to "the http jobs" vs. "all jobs pushing to this project" |
Nothing merged to https://github.com/kubernetes/test-infra/commits/master/config/jobs/image-pushing in between last pass / first fail |
Things merged to https://github.com/kubernetes/k8s.io that could have changed something:
|
I saw cluster-api-aws recently push an image, so using them as a known-good bad="e2e-test-images"
good="cluster-api-aws"
for s in "${bad}" "${good}"; do
p="k8s-staging-${s}"
output="iam-${s}.txt"
(
for b in $(gsutil ls -p "${p}"); do
gsutil iam get "${b}" | jq 'del(.etag)'
done
gcloud --format=json projects get-iam-policy "${p}" | jq 'del(.etag)'
) > "${output}"
num=$(gcloud projects describe k8s-infra-prow-build "--format=value(projectNumber)")
sed -i.bak -e "s/${s}/subproject/g;s/${num}/123456789/g" "${output}"
rm -f "${output}.bak"
done
diff -y -W 100 iam-{${bad},${good}}.txt no diff in gcs iam; the e2e project appears to have additional role bindings, but how would that restrict access?
|
retrying the gcb build of this image that worked last week https://console.cloud.google.com/cloud-build/builds;region=global/818e9010-29ad-4fa2-ab81-7547c6f1c0c8?project=k8s-staging-e2e-test-images |
that worked, which leads me to believe a change introduced by kubernetes/kubernetes#99030 is the culprit |
I wonder how. The changes in that PR affects Windows, since those changes are made into the |
I can't find a reference for this but I'm wondering if the docker registry api doesn't like that tag format? |
https://docs.docker.com/engine/reference/commandline/tag/#extended-description
which makes it sound like the new tag would be valid |
I was able to build and push |
Going to try messing with the source file contents and submitting builds manually # get the source
mkdir -p gcb && cd gcb
gsutil cp gs://k8s-staging-e2e-test-images-gcb/source/1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz .
tar xvzf 1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz && rm 1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz
# make a change
vi test/images/httpd/VERSION
# try running a build with the change
tar -czf ../spiffxp-http-gcb.tgz *
gsutil cp ../spiffxp-http-gcb.tgz gs://k8s-staging-e2e-test-images-gcb/source/
gcloud builds submit \
--verbosity debug \
--config /Users/spiffxp/w/kubernetes/kubernetes/test/images/cloudbuild.yaml \
--substitutions _PULL_BASE_REF=master,_WHAT=httpd,_GIT_TAG=v20210218-v1.21.0-alpha.3-197-g9e5fcc49ec5 \
--project k8s-staging-e2e-test-images \
--gcs-log-dir gs://k8s-staging-e2e-test-images-gcb/logs \
--gcs-source-staging-dir gs://k8s-staging-e2e-test-images-gcb/source \
gs://k8s-staging-e2e-test-images-gcb/source/spiffxp-http-gcb.tgz |
Same result (same SHA as before)
|
trying a different image
yields a similar error
|
Using source file from "good" build ( Manually submitting with no changes works Changing to push nginx with new version
yields a similar error
so... new tag, unchanged manifest = error... is this working as intended, or should we be allowing this? |
https://cloud.google.com/container-registry/docs/access-control#permissions_and_roles my guess now is that |
/wg k8s-infra |
Per https://cloud.google.com/container-registry/docs/gcr-service-account, the GCR service account for staging projects today may be overprivileged:
Per https://cloud.google.com/container-registry/docs/access-control#permissions_and_roles, the following are required for "Push (write) and pull (read) images for existing registry hosts in a project"
The bucket permissions are for "Add a registry to a project by pushing the first image to the registry host", which I think the k8s-infra staging project script does. I used https://cloud.google.com/iam/docs/troubleshooting-access to gather permissions for different accounts against the gcs bucket backing gcr.io/k8s-staging-e2e-test-images (ref: https://gist.github.com/spiffxp/0110b663a229ebda1e805d92b23e646b)
Going to look at these in order:
|
Give GCR's SA
Run same build as #20884 (comment), same result So removing that
|
Giving GCB's SA
The only other storage role with Run same build as #20884 (comment), same result |
/remove-priority critical-urgent If we don't need to solve, I'd rather wait for the fix to come to us. AFAICT the latest release of moby/buildkit still lacks the fix, as does
But I suspect it's coming soon:
The alternative would be someone taking on the work of using unreleased code. |
I'm kicking this off the wg-k8s-infra project board since it seems like no action required on our part |
/remove-area prow |
/sig release |
/area images So like a PR to update this: test-infra/images/gcb-docker-gcloud/Dockerfile Lines 44 to 48 in 97239cd
And a subsequent PR to kubernetes/kubernetes to update cloudbuild.yaml |
Opened #22977 to take a crack at bumping buildx to the latest release to see if this fixes this issue |
Did this fail since then? I thought we solved this a few months ago. The issue was that the same SHA couldn't be pushed again, so we added a label in the images / dockerfiles. Haven't seen it fail since. |
This is what I'm trying to prove is fixed. We should never have had to do the workaround in the first place, but some part of buildx or its dependencies was written in a way that didn't work with GCR (among other container registries) |
/milestone v1.23 |
The issue has been resolved some time ago. We have promoted httpd images since this was open, and the last job was green as well: https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images /close |
@claudiubelu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
The Image Builder postsubmit jobs
post-kubernetes-push-e2e-httpd-test-images
andpost-kubernetes-push-e2e-new-httpd-test-images
are failing with a 401 Unauthorized error while trying to push togcr.io/k8s-staging-e2e-test-images
.What you expected to happen:
It should have been able to push the images.
How to reproduce it (as minimally and precisely as possible):
Rerun the jobs.
Please provide links to example occurrences, if any:
[1] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images
[2] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-new-test-images
[3] https://testgrid.k8s.io/sig-testing-images#kubernetes-e2e-windows-servercore-cache
Anything else we need to know?:
Worth noting that the job passed on 2021.02.09, but failed on 2021.02.15. The prow job config is fine, running the
k8s-staging-e2e-test-images.sh
script that generated the job reveals no diff.Additionally, on 2021.02.11 the
kubernetes-e2e-windows-servercore-cache
job passed [3], a job which is similarly defined to the other 2 jobs.The text was updated successfully, but these errors were encountered: