Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-daemon] Offer services not just on a nodePort #2956

Closed
csweichel opened this issue Jan 18, 2021 · 25 comments
Closed

[ws-daemon] Offer services not just on a nodePort #2956

csweichel opened this issue Jan 18, 2021 · 25 comments
Labels
component: ws-daemon component: ws-manager meta: stale This issue/PR is stale and will be closed soon type: feature request New feature or request

Comments

@csweichel
Copy link
Contributor

csweichel commented Jan 18, 2021

ws-manager talks to specific ws-daemon instances, depending on the workspace in question. So far we've been using a nodePort to make that happen. It turns out that we simply "got lucky" and Calico just happens to allow pod/nodePort talk. But, that's non-standard and other CNI's don't support it, hence we must look for alternatives. Potential solutions are:

  1. ws-manager looks up ws-daemon: pods can talk to other pods, if their NetworkPolicy allows it. Instead of relying on a nodePort, we could have ws-manager look for a ws-daemon running on the workspace pod's node, akin to what kubectl port-forward service/ws-daemon would do. This would straight forward to implement, but somewhat re-implement Kubernetes services.
  2. service topology: allows more control over the routing of requests to services. We could use this to ensure that we talk to the correct ws-daemon using a Kubernetes service. This feature is a Kubernetes 1.17 alpha feature though which means we'd have to decide on [self-hosted] Platform Support Matrix #2657 first.
  3. hostNetwork: using the hostNetwork migh be a possibility, if the nodes can talk to each other (by no means a given). This very much depends on the cluster setup, much like the current solution does. Also, it would give ws-daemon far more power than it currently has.

We'll go with option 1.

@stefanfritsch
Copy link

That's what I call a timely issue. I just had the issue 2 hours ago. :)

Correct me if I'm wrong but service topologies currently don't allow you to select by labels or such. You can just ensure that you end up on the same host.

So the workspace pod would have to call the ws-daemon using a service with "kubernetes.io/hostname" (to get the daemon on the same node) and then that ws-daemon would have to tell the manager that it's the correct one. This more or less sounds like solution 1.

@jgallucci32
Copy link
Contributor

jgallucci32 commented Jan 18, 2021

I'm in favor of both option 1 and 3. I would caution against option 2 since K8s distros such as Rancher strip alpha features to avoid them being used in production settings. OpenShift is the same way, by default you get only GA features and have to explicitly enable alpha/beta features through specific Red Hat channels.

@csweichel
Copy link
Contributor Author

csweichel commented Jan 19, 2021

Correct me if I'm wrong but service topologies currently don't allow you to select by labels or such. You can just ensure that you end up on the same host.

So the workspace pod would have to call the ws-daemon using a service with "kubernetes.io/hostname" (to get the daemon on the same node) and then that ws-daemon would have to tell the manager that it's the correct one. This more or less sounds like solution 1.

Good point, even with option 2 when using "kubernetes.io/hostname" we'd have to ensure we got the right ws-daemon.

@csweichel csweichel added this to the February 2021 milestone Jan 19, 2021
@stefanfritsch
Copy link

I don't know the internals but you can just watch spec.nodeName of

/api/v1/watch/namespaces/$GITPOD_NAMESPACE/pods?fieldselector=something-that-finds-ws-pods="true"

to get notified of node placements.

@yyexplore
Copy link

Would you also pls consider the registry-facade daemon @csweichel ? Since hostPort was not supported by some networks, and is not secure either.

@jgallucci32
Copy link
Contributor

jgallucci32 commented Jan 26, 2021

@yyexplore Is there a reported issue or can you provide steps to reproduce a problem with registry-facade which would require hostPort as well? I have only seen an issue with ws-daemon requiring access to the host network using Canal on RKE. I've actually wondered myself why registry-facade is not causing an issue, given it has a similar network configuration as ws-daemon (and only other service which uses a host port)

@yyexplore
Copy link

@yyexplore Is there a reported issue or can you provide steps to reproduce a problem with registry-facade which would require hostPort as well? I have only seen an issue with ws-daemon requiring access to the host network using Canal on RKE. I've actually wondered myself why registry-facade is not causing an issue, given it has a similar network configuration as ws-daemon (and only other service which uses a host port)

@jgallucci32 , I didn't report a separate issue, but it should be similar with ws-daemon for registry-facade, the registry-facade-daemonset.yaml generated by Helm Chart shows

        ports:
        - name: registry
          containerPort: 32223
          hostPort: 3000

@csweichel
Copy link
Contributor Author

For registry facade there's no way around the hostPort. Registry-facade is consumed by the container runtime of the cluster, hence by a service outside of Kubernetes. AFAIK the only way to make something like that work is using a hostPort.

@jgallucci32
Copy link
Contributor

The premise behind this was under certain conditions ws-daemon requires access to the host network with hostNetwork: true in order to function. This is not the case for registry-facade, it is using a hostPort and is fully functional without that setting.

Digging further I see a difference in the two which might be triggering this

registry-facade

securityContext:
  privileged: false
  runAsUser: 1000
...
hostPID: false

ws-daemon

securityContext:
  privileged: true
procMount: Default
...
hostPID: true

Another thing which confuses me is you state "far we've been using a nodePort to make [ws-daemon] happen". But as I look at my 0.6.0 deployment, the ws-daemon has all 3 container ports mapped to a hostPort. The only delta being I patched the daemonset to have hostNetwork: true to make it functional. Is something happening with the deployment where Helm is making a determination at runtime based on K8s or is the premise misstated in this thread?

@csweichel
Copy link
Contributor Author

I think we haven't done a particularly good job communicating what those components do, hence what they require:

  • registry-facade is what workspace pods pull their image from (in 0.6.0 when feature preview is enabled, in always 0.7.0 - that's probably why it's "working" for you without hostNetwork). This service is consumed by your cluster's container runtime, hence outside the Kubernetes network. This service must run with hostNetwork: true to function, because services outside of the Kubernetes network need to talk to it.
  • ws-daemon is kind of like the kubelet. As you already pointed out it's fairly privileged (hostPID, unmasked proc mount, runs as root). The only component to directly connect to ws-daemon over the network is ws-manager. ws-manager however needs to be able to talk to the ws-dameon that corresponds to a given workspace, i.e. runs on the same node. To this end, we've relied on the hostIP and hostPorts. The generous use of hostPorts in ws-daemon looks like an oversight - only one hostPort is required for ws-daemon to function.

@csweichel csweichel added the priority: 💪 stretch goal This issue is a stretch goal within an iteration. label Feb 9, 2021
@shiftyp
Copy link

shiftyp commented Feb 9, 2021

I'm experiencing a related issue I believe in trying to create a workspace in a managed Kubernetes cluster. From my ws-manager logs:

{"instanceId":"[REDACTED]","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"[REDACTED]","metadata":{"owner":"[REDACTED]","meta_id":"[REDACTED]","started_at":{"seconds":1612880080}},"spec":{"workspace_image":"[REDACTED]/gitpod/workspace-images:373a0260c3c5f5b1735adf3939c58d08d02559df2240087f3c6023f99f883bf2","ide_image":"gcr.io/gitpod-io/self-hosted/theia-ide:0.6.0","url":"https://[REDACTED].ws.[REDACTED]","exposed_ports":[{"port":3000,"visibility":1,"url":"https://3000-[REDACTED].ws.[REDACTED]"}],"timeout":"30m"},"phase":5,"conditions":{"failed":"cannot initialize workspace: cannot connect to workspace daemon","service_exists":1,"deployed":1},"runtime":{"node_name":[REDACTED]}"},"auth":{"owner_token":"[REDACTED]"}},"time":"2021-02-09T14:14:52Z","userId":"[REDACTED]","workspaceId":"[REDACTED]"}

The entire ws-daemon DaemonSet is running / available on the same node, whose nodeName is specified in the error. (Ignore that bit, they are scheduled on different nodes). Let me know what further info I can provide.

@shiftyp
Copy link

shiftyp commented Feb 9, 2021

My issue may be unrelated, and just a case of a misleading error message. I see this in one of the ws-daemon pods:

{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","ID":"38566085e59bbbaa1f7feb7f8990f5c5ad91573f8e465341ae7176991279b2db","containerImage":"","error":"not found\ngithub.com/containerd/containerd/errdefs.init\n\t/workspace/go/pkg/mod/github.com/containerd/[email protected]/errdefs/errors.go:45\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5625\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.doInit\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:5620\nruntime.main\n\t/home/gitpod/sdk/go1.15/src/runtime/proc.go:191\nruntime.goexit\n\t/home/gitpod/sdk/go1.15/src/runtime/asm_amd64.s:1374\ncontainer \"38566085e59bbbaa1f7feb7f8990f5c5ad91573f8e465341ae7176991279b2db\" in namespace \"k8s.io\"\ngithub.com/containerd/containerd/errdefs.FromGRPC\n\t/workspace/go/pkg/mod/github.com/containerd/[email protected]/errdefs/grpc.go:107\ngithub.com/containerd/containerd.(*remoteContainers).Get\n\t/workspace/go/pkg/mod/github.com/containerd/[email protected]/containerstore.go:50\ngithub.com/gitpod-io/gitpod/ws-daemon/pkg/container.(*Containerd).handleContainerdEvent\n\t/tmp/build/components-ws-daemon--app.c9ba40007509fa738934dc59cd39b1b07b64fe8f/pkg/container/containerd.go:166\ngithub.com/gitpod-io/gitpod/ws-daemon/pkg/container.(*Containerd).start\n\t/tmp/build/components-ws-daemon--app.c9ba40007509fa738934dc59cd39b1b07b64fe8f/pkg/container/containerd.go:149\nruntime.goexit\n\t/home/gitpod/sdk/go1.15/src/runtime/asm_amd64.s:1374","level":"warning","message":"cannot find container we just received a create event for","serviceContext":{"service":"ws-daemon","version":""},"severity":"WARNING","time":"2021-02-09T16:21:28Z"}

This might be related to an issue I ran into with my install of the DaemonSet (on Linode Kubernetes) with /run/containerd/io.containerd.runtime.v1.linux/k8s.io not existing on the host node. I changed it to /run/containerd/io.containerd.runtime.v1.linux/moby, which does exist, but this may have broken something else (I only partially understand all this). Thanks, and let me know if I should file another issue.

@shiftyp
Copy link

shiftyp commented Feb 9, 2021

I'm not sure the above is the issue. I changed the following line to read 'moby' instead of 'k8s.io', and now I don't see the error in the daemon logs, but I still see the same error from the ws-manager. Again though, I only partially understand any of this

kubernetesNamespace = "k8s.io"

@csweichel csweichel removed this from the February 2021 milestone Mar 1, 2021
@csweichel csweichel removed the priority: 💪 stretch goal This issue is a stretch goal within an iteration. label Mar 1, 2021
@stefanfritsch
Copy link

Btw., I wouldn't call this a feature request unless you specify Calico as a requirement. This will keep gitpod from working at all on e.g. weave unless you nail ws-manager and ws-daemon to a single node.

@jawabuu
Copy link

jawabuu commented Mar 16, 2021

Hey @csweichel
How can I work around this right now?
I'm unable to use hostNetwork or change my CNI.
I'm using version 0.6.0 deployed via helm on a k3s cluster.
My logs for a terminated ws- pod

{"envvar":"SUPERVISOR_ADDR","level":"debug","message":"passing environment variable to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"envvar":["PATH","HOSTNAME","LANG","HOME","TRIGGER_REBUILD","APACHE_DOCROOT_IN_REPO","NGINX_DOCROOT_IN_REPO","TRIGGER_BREW_REBUILD","MANPATH","INFOPATH","HOMEBREW_NO_AUTO_UPDATE","GO_VERSION","GOPATH","GOROOT","GRADLE_USER_HOME","NODE_VERSION","GITPOD_WORKSPACE_ID","GITPOD_INSTANCE_ID","GITPOD_CLI_APITOKEN","GITPOD_THEIA_PORT","GITPOD_WORKSPACE_URL","GITPOD_GIT_USER_NAME","GITPOD_RESOLVED_EXTENSIONS","GITPOD_REPO_ROOT","GITPOD_HOST","THEIA_MINI_BROWSER_HOST_PATTERN","GITPOD_TASKS","GITPOD_INTERVAL","GITPOD_MEMORY","THEIA_WORKSPACE_ROOT","THEIA_WEBVIEW_EXTERNAL_ENDPOINT","GITPOD_GIT_USER_EMAIL","SUPERVISOR_ADDR"],"level":"debug","message":"passing environment variables to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"host":"gitpod.example.com","kind":"gitpod","level":"info","message":"registered new token","reuse":"REUSE_WHEN_POSSIBLE","scopes":{"function:closePort":{},"function:controlAdmission":{},"function:generateNewGitpodToken":{},"function:getLayout":{},"function:getLoggedInUser":{},"function:getOpenPorts":{},"function:getPortAuthenticationToken":{},"function:getToken":{},"function:getWorkspace":{},"function:getWorkspaceOwner":{},"function:getWorkspaceTimeout":{},"function:getWorkspaceUsers":{},"function:isWorkspaceOwner":{},"function:openPort":{},"function:sendHeartBeat":{},"function:setWorkspaceTimeout":{},"function:stopWorkspace":{},"function:storeLayout":{},"function:takeSnapshot":{},"resource:gitpodToken::*::create":{},"resource:snapshot::*::create/get":{},"resource:token::*::get":{},"resource:userStorage::*::create/get/update":{},"resource:workspace::aebe0ada-b976-464c-8fa7-57968cf01e68::get/update":{},"resource:workspaceInstance::fe1febf7-0159-4b7f-805a-5b3c512275b8::get/update/delete":{}},"serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:13Z"}
{"envvar":"SUPERVISOR_ADDR","level":"debug","message":"passing environment variable to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"envvar":["PATH","HOSTNAME","LANG","HOME","TRIGGER_REBUILD","APACHE_DOCROOT_IN_REPO","NGINX_DOCROOT_IN_REPO","TRIGGER_BREW_REBUILD","MANPATH","INFOPATH","HOMEBREW_NO_AUTO_UPDATE","GO_VERSION","GOPATH","GOROOT","GRADLE_USER_HOME","NODE_VERSION","GITPOD_WORKSPACE_ID","GITPOD_INSTANCE_ID","GITPOD_CLI_APITOKEN","GITPOD_THEIA_PORT","GITPOD_WORKSPACE_URL","GITPOD_GIT_USER_NAME","GITPOD_RESOLVED_EXTENSIONS","GITPOD_REPO_ROOT","GITPOD_HOST","THEIA_MINI_BROWSER_HOST_PATTERN","GITPOD_TASKS","GITPOD_INTERVAL","GITPOD_MEMORY","THEIA_WORKSPACE_ROOT","THEIA_WEBVIEW_EXTERNAL_ENDPOINT","GITPOD_GIT_USER_EMAIL","SUPERVISOR_ADDR"],"level":"debug","message":"passing environment variables to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"error":"open /workspace/.gitpod/content.json: no such file or directory","level":"info","message":"no content init descriptor found - not trying to run it","serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:13Z"}
{"args":["/workspace/gtp","--port","23000","--hostname","0.0.0.0"],"entrypoint":"/theia/node_modules/@gitpod/gitpod-ide/startup.sh","level":"info","message":"launching IDE","serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:13Z"}
{"envvar":"SUPERVISOR_ADDR","level":"debug","message":"passing environment variable to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"envvar":["PATH","HOSTNAME","LANG","HOME","TRIGGER_REBUILD","APACHE_DOCROOT_IN_REPO","NGINX_DOCROOT_IN_REPO","TRIGGER_BREW_REBUILD","MANPATH","INFOPATH","HOMEBREW_NO_AUTO_UPDATE","GO_VERSION","GOPATH","GOROOT","GRADLE_USER_HOME","NODE_VERSION","GITPOD_WORKSPACE_ID","GITPOD_INSTANCE_ID","GITPOD_CLI_APITOKEN","GITPOD_THEIA_PORT","GITPOD_WORKSPACE_URL","GITPOD_GIT_USER_NAME","GITPOD_RESOLVED_EXTENSIONS","GITPOD_REPO_ROOT","GITPOD_HOST","THEIA_MINI_BROWSER_HOST_PATTERN","GITPOD_TASKS","GITPOD_INTERVAL","GITPOD_MEMORY","THEIA_WORKSPACE_ROOT","THEIA_WEBVIEW_EXTERNAL_ENDPOINT","GITPOD_GIT_USER_EMAIL","SUPERVISOR_ADDR"],"level":"debug","message":"passing environment variables to IDE","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-03-16T04:38:13Z"}
{"error":"Get \"http://localhost:23000/\": dial tcp [::1]:23000: connect: connection refused","level":"info","message":"IDE is not ready yet","serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:13Z"}
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
{"level":"info","message":"connection was successfully established","serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:13Z","url":"wss://gitpod.example.com/api/v1"}
{"error":"Get \"http://localhost:23000/\": dial tcp [::1]:23000: connect: connection refused","level":"info","message":"IDE is not ready yet","serviceContext":{"service":"supervisor","version":""},"severity":"INFO","time":"2021-03-16T04:38:18Z"}

@jawabuu
Copy link

jawabuu commented Mar 16, 2021

@stefanfritsch How do I restrict ws-manager and ws-daemon to a single node?

@csweichel
Copy link
Contributor Author

@jawabuu: the workspace pod logs wouldn't show this particular issue. The "connection refused" you're seeing can be a regular part of a workspace starting up.

If you looked at your ws-manager logs, there you'd be seeing failed connection attempts towards ws-daemon.

@jawabuu
Copy link

jawabuu commented Mar 16, 2021

@csweichel You are correct. I did see that.

@jawabuu
Copy link

jawabuu commented Mar 16, 2021

Is it possible to disable hostPort for ws-daemon?
I could then try to use a Kubernetes service although it seems that ws-manager must connect to a ws-daemon in specific nodes?

@jawabuu
Copy link

jawabuu commented Mar 16, 2021

@csweichel Alternatively, is the port that ws-manager uses to connect to ws-daemon configurable?

@stale
Copy link

stale bot commented Jun 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jun 14, 2021
@stefanfritsch
Copy link

Is there a reason the rl-gitpod merge request is not getting merged? Apart from lack of time :)

@stale stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Jun 14, 2021
@stale
Copy link

stale bot commented Sep 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Sep 12, 2021
@dejan9393
Copy link

We're having an issue with the ws-daemon on our RKE cluster - one of our worker nodes successfully schedules the pod and starts the container, but the other one is met with the error "1 node(s) didn't have free ports for the requested pod ports". I'm assuming this is due to hostPort: 8080. lsof, netstat, telnet, and any other method we've tried do not show anything running on port 8080 on that node. Any ideas?

@stale stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Sep 17, 2021
@stale
Copy link

stale bot commented Dec 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Dec 16, 2021
@stale stale bot closed this as completed Dec 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ws-daemon component: ws-manager meta: stale This issue/PR is stale and will be closed soon type: feature request New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants