Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Node creation, pull image from private repo fails "Forbidden" only for first 10-15 minutes of new node creation #3877

Closed
cdenneen opened this issue Nov 16, 2017 · 21 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@cdenneen
Copy link

New node spins up with the following:

Failed to pull image "artifactserver.example.com/gitlab/gitlab-runner:v1.11.5": rpc error: code = 2 desc = Error response from daemon: {"message":"unknown: Forbidden"}
Error syncing pod

If I wait 10-15 minutes it eventually works.

However if I login to the node and do a docker pull artifactserver.example.com/gitlab/gitlab-runner:v1.11.5 it pulls down with no issue and then the pod starts within a few seconds upon next retry.

Basically what I'm trying to understand is why the 10-15 minute delay with new nodes pulling that image from private registry. Why when I pull it manually it behaves any different than the pod creation does.

@justinsb
Copy link
Member

Is the artifactserver a GCR / ECR server? Where are the credentials stored?

@cdenneen
Copy link
Author

@justinsb the artifactserver is Artifactory. Which is added as insecureRegistry to the kind: Cluster configuration. Also pull's don't require creds... just push.

@mikesplain
Copy link
Contributor

@cdenneen What networking are you using? We ran into something similar with calico #3224.

@cdenneen
Copy link
Author

@mikesplain that's it!!!! have you found a solution? Haven't seen any traction on #3224 in a while @chrislovecnm might know if this is being handled outside that issue?

@cdenneen
Copy link
Author

@mikesplain should we switch to using something other than calico?

@mikesplain
Copy link
Contributor

@cdenneen Glad to hear it! Well looks like we'll have a path forward soon based on #3224 (comment).

Anyway, my current workaround is a cleanup script that we schedule a cronjob. Give me a few and I'll open source it.

@mikesplain
Copy link
Contributor

@cdenneen Take a look at this, I haven't tested this directly, since I run it via a helm chart, but it should help you out:

https://github.com/mikesplain/calico-clean

@cdenneen
Copy link
Author

@mikesplain Thanks...
does the schedule have to be quoted?

work/capdev-kubernetes » kubectl create -f calico-clean.yaml
error: error converting YAML to JSON: yaml: line 8: did not find expected alphabetic or numeric character
work/capdev-kubernetes » cat -n calico-clean.yaml | grep -A2 -B2 ' 8'
     6	  labels:
     7	    role.kubernetes.io/networking: "1"
     8	spec:
     9	  schedule: */5 * * * *
    10	  concurrencyPolicy: Replace

@cdenneen
Copy link
Author

@mikesplain

So I got the cronjob installed but I'm not able to find it using kubectl get cronjobs

API server is running with --runtime-config=batch/v2alpha1=true (had to figure that part out)

To load it need to do --validate=false... maybe i'm not waiting long enough.

@mikesplain
Copy link
Contributor

@cdenneen it's under the kube-system namespace. kubectl get cronjobs --namespace kube-system

I'm not positive this will solve your issue since you are getting some sort of response... hmm

@cdenneen
Copy link
Author

Yeah this might not be the same issue...
Issue I'm having is a new node comes up... doesn't have the image for my stateful set.
The imagePull gets an unknown: Forbidden

Failed to pull image "artifactserver.example.com/gitlab/gitlab-runner:v1.11.5": rpc error: code = 2 desc = Error response from daemon: {"message":"unknown: Forbidden"}
Error syncing pod

I know it's not a connectivity and permission issue from the private repo because if I login to the node and do the docker pull artifactserver.example.com/gitlab/gitlab-runner:v1.11.5 it pulls down without issue and once it's complete, usually by the time I refresh the Dashboard or a get po, I can see the pods running.

@chrislovecnm
Copy link
Contributor

A workaround is to have a hook that does an image pull, but that is a work around. Can you get us kubelet logs?

I am guessing that this is an upstream issue btw. Anyone else agree / disagree?

@cdenneen
Copy link
Author

2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc57504b499e   Pod                                 Normal    Scheduled               default-scheduler                       Successfully assigned runner-434cb7f1-project-103-concurrent-0wpc7m to ip-10-240-51-63.ec2.internal
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc57600fbcc5   Pod                                 Normal    SuccessfulMountVolume   kubelet, ip-10-240-51-63.ec2.internal   MountVolume.SetUp succeeded for volume "repo"
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc576052d1ed   Pod                                 Normal    SuccessfulMountVolume   kubelet, ip-10-240-51-63.ec2.internal   MountVolume.SetUp succeeded for volume "default-token-7b27b"
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc577f67ebcd   Pod       spec.containers{build}    Normal    Pulled                  kubelet, ip-10-240-51-63.ec2.internal   Container image "artifactserver.example.com/ruby:2.1.9" already present on machine
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc5782473e73   Pod       spec.containers{build}    Normal    Created                 kubelet, ip-10-240-51-63.ec2.internal   Created container
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc5789756acd   Pod       spec.containers{build}    Warning   Failed                  kubelet, ip-10-240-51-63.ec2.internal   Error: failed to start container "build": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:359: container init caused \\\\\\\"rootfs_linux.go:53: mounting \\\\\\\\\\\\\\\"/var/lib/kubelet/pods/3073c487-cbdd-11e7-9c9c-021e13f74eaa/volumes/kubernetes.io~empty-dir/repo\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay/658bcf2ee80186f8257b8bbfa6811dd3466723b248f96a8ce89043c174575e5d/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay/658bcf2ee80186f8257b8bbfa6811dd3466723b248f96a8ce89043c174575e5d/merged/core\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\\n\""}
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc57898fbcd6   Pod       spec.containers{helper}   Normal    Pulled                  kubelet, ip-10-240-51-63.ec2.internal   Container image "gitlab/gitlab-runner-helper:x86_64-cbfcb5c" already present on machine
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc578c5037b8   Pod       spec.containers{helper}   Normal    Created                 kubelet, ip-10-240-51-63.ec2.internal   Created container
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc5790124f02   Pod       spec.containers{helper}   Normal    Started                 kubelet, ip-10-240-51-63.ec2.internal   Started container
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc579013e078   Pod                                 Warning   FailedSync              kubelet, ip-10-240-51-63.ec2.internal   Error syncing pod
2m          2m           1         runner-434cb7f1-project-103-concurrent-0wpc7m.14f7fc5afe385268   Pod       spec.containers{helper}   Normal    Killing                 kubelet, ip-10-240-51-63.ec2.internal   Killing container with id docker://helper:Need to kill Pod

@cdenneen
Copy link
Author

Here is the kubelet info from the node:

kubelet.log

@justinsb justinsb added this to the 1.8.0 milestone Nov 26, 2017
@cdenneen
Copy link
Author

cdenneen commented Dec 5, 2017

Anyone know how I can add to kops my nodes ig some sort of hook to do the "docker pull" automatically rather than logging in to each of these nodes to get past the delay?

@justinsb justinsb modified the milestones: 1.8.0, 1.9 Feb 21, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2018
@justinsb justinsb modified the milestones: 1.9.0, 1.10 May 26, 2018
@cdenneen
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 20, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants