Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken integration test TestRegularWorkspaceTasks #9575

Closed
2 tasks done
Tracked by #8799
jenting opened this issue Apr 27, 2022 · 20 comments · Fixed by #9960 or #9968
Closed
2 tasks done
Tracked by #8799

Fix broken integration test TestRegularWorkspaceTasks #9575

jenting opened this issue Apr 27, 2022 · 20 comments · Fixed by #9960 or #9968
Assignees

Comments

@jenting
Copy link
Contributor

jenting commented Apr 27, 2022

Bug description

The integration test TestRegularWorkspaceTasks is broken with the errors

=== RUN   TestRegularWorkspaceTasks
 === RUN   TestRegularWorkspaceTasks/ws-manager
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/init
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/before
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/command
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 --- FAIL: TestRegularWorkspaceTasks (23.75s)
     --- FAIL: TestRegularWorkspaceTasks/ws-manager (23.75s)
         --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks (23.75s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/init (9.68s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/before (7.01s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/command (7.06s)

The analysis root cause could be #8800 (comment)

Steps to reproduce

# Switch to preview environment
./dev/preview/install-k3s-kubeconfig.sh

# Run integration test
cd test
go test -v ./... \
   -kubeconfig=/home/gitpod/.kube/config \
   -namespace=default \
   -run=TestRegularWorkspaceTasks

Tasks

  • Fix supervisor test
  • Fix agent part which checks existence of files

Workspace affected

No response

Expected behavior

No response

Example repository

No response

Anything else?

No response

@jenting
Copy link
Contributor Author

jenting commented Apr 27, 2022

In the current code, I encounter the certificate issue, waiting for PR #9553 fixed.
We could revisit this issue once the certificate issue is addressed.

Related Slack Thread.

@jenting
Copy link
Contributor Author

jenting commented May 3, 2022

The supervisor opens the port on 22999 on IPv6 localhost address

$ netstat -ntlp  | grep 22999
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 :::22999                :::*                    LISTEN      -

Port forward to address 0.0.0.0 with local port 32999 to remote port 22999

$ kubectl port-forward --address=0.0.0.0 pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999

Access with supervisor endpoint

$ curl -XGET 'http://127.0.0.1:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server

$ k port-forward --address=0.0.0.0 pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 0.0.0.0:32999 -> 22999
Handling connection for 32999
E0503 09:36:56.344879   45463 portforward.go:406] an error occurred forwarding 32999 -> 22999: error forwarding port 22999 to pod 88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753, uid : failed to execute portforward in network namespace "/var/run/netns/cni-1d91c723-efc9-a8d1-ea28-513326b839ed": failed to connect to localhost:22999 inside namespace "88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753", IPv4: dial tcp4 127.0.0.1:22999: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found 
E0503 09:36:56.345241   45463 portforward.go:234] lost connection to pod

I tried change the port-forward with --address=localhost

$ kubectl port-forward --address=localhost pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999
Forwarding from [::1]:32999 -> 22999

The supervisor opens the port on 22999 on IPv6 localhost address, and opened port 32999 on IPv4 and IPv6.

 netstat -tnlp | grep 32999
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.0.5.2:32999          0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:32999         0.0.0.0:*               LISTEN      45942/kubectl       
tcp6       0      0 ::1:32999               :::*                    LISTEN      45942/kubectl

Access with supervisor endpoint

$ curl -XGET 'http://127.0.0.1:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server
$ curl -XGET 'http://[::1]:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server

$ kubectl port-forward --address=localhost pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999
Forwarding from [::1]:32999 -> 22999
Handling connection for 32999
E0503 09:42:26.602006   46159 portforward.go:406] an error occurred forwarding 32999 -> 22999: error forwarding port 22999 to pod 88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753, uid : failed to execute portforward in network namespace "/var/run/netns/cni-1d91c723-efc9-a8d1-ea28-513326b839ed": failed to connect to localhost:22999 inside namespace "88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753", IPv4: dial tcp4 127.0.0.1:22999: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found 
E0503 09:42:26.602432   46159 portforward.go:234] lost connection to pod

@gitpod-io/engineering-workspace Any thought or comment?

@princerachit
Copy link
Contributor

Hi, I am looking into this error. Will update this issue once I find something.

@princerachit
Copy link
Contributor

princerachit commented May 4, 2022

Workspace logs

The workspace created through this test has the following logs. Notice last line not connected to Gitpod server error.

...
...
...
Web UI available at http://localhost:23000/
[15:06:26] Extension host agent started.
{"ide":"IDE","level":"info","message":"IDE readiness took 5.307 seconds","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"INFO","time":"2022-05-04T15:06:26Z"}
{"ide":"IDE","level":"info","message":"IDE is ready","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"INFO","time":"2022-05-04T15:06:26Z"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"not connected to Gitpod server","level":"error","message":"error tracking supervisor_readiness","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"ERROR","time":"2022-05-04T15:06:26Z"}

@princerachit
Copy link
Contributor

princerachit commented May 4, 2022

The port-forward will always fail for supervisor port. When kubectl tries to port forward it does that in context of a network namespace. This network namespace is the one which was originally created for the pod. This is the same nw namespace where ./workspacekit ring0 command was run (the entrypoint). workspacekit creates a new set of namespaces in ring1 which in turn creates another set of namespaces in ring2. This ultimate set of namespaces is wherethe supervisor is run see architecture.

Therefore, kubectl does not have access to the namespace where supervisor is running, hence the error.

Proposed solution

  1. Modify the api.Supervisor method to port forward from correct nw ns (PREFERABLE); OR
  2. Find another way to check the status of task

@jenting

correct me if I am wrong @gitpod-io/engineering-workspace

@jenting
Copy link
Contributor Author

jenting commented May 5, 2022

The port-forward will always fail for supervisor port. When kubectl tries to port forward it does that in context of a network namespace. This network namespace is the one which was originally created for the pod. This is the same nw namespace where ./workspacekit ring0 command was run (the entrypoint). workspacekit creates a new set of namespaces in ring1 which in turn creates another set of namespaces in ring2. This ultimate set of namespaces is wherethe supervisor is run see architecture.

Therefore, kubectl does not have access to the namespace where supervisor is running, hence the error.

Proposed solution

  1. Modify the api.Supervisor method to port forward from correct nw ns (PREFERABLE); OR
  2. Find another way to check the status of task

@jenting

correct me if I am wrong @gitpod-io/engineering-workspace

Thanks Prince's analysys.
I'd prefer solution 1, but how do we get the supervisor original network namespace?

@princerachit
Copy link
Contributor

kubectl cli does not have any configuration option for port-forward to choose the namespace (container namespace not kubernetes namespace) in which the port forwarding should run. Neither the kubelet code takes such configuration parameter.

One way to solve this could be run an exec command through kubectl inorder to expose the port. I still need to figure out how to do this and if this is feasible.

Other way is to make use of integration test agent to enter relevant namespace and then run a curl on the supervisor endpoint.

@princerachit princerachit assigned princerachit and unassigned jenting May 9, 2022
@csweichel
Copy link
Contributor

Could we not get the owner token of the workspace and talk to the supervisor API using the IDE URL?

@princerachit
Copy link
Contributor

Thanks @csweichel. That sounds like the easier way to check supervisor status. I will use this approach.

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

@kylos101 kylos101 moved this from Blocked to In Progress in 🌌 Workspace Team May 9, 2022
@utam0k
Copy link
Contributor

utam0k commented May 9, 2022

Hi, this is a workspace network architecture. It may help you to investigate.

          Pod Network Namespace(ring1)
+------------------------------------------------+
|                                                |
|       Workspace Network Namesapce(ring2)       |
| +--------------------------------------------+ |
| |                                            | |
| |              default via veth0             | |
| |                                            | |
| |                                            | |
| |     +------+  +--------------+             | |
| |     |  lo  |  |    ceth0     | 10.0.2.2/24 | |
| |     +------+  +--^--------+--+             | |
| |                  |        |                | |
| +------------------+--------+----------------+ |
|                    |        |                  |
|                 +--+--------v--+               |
|   +-----------> |    veth0     | 10.0.2.1/24   |
|   |             +-----------+--+               |
|   |                         |                  |
|   |          +--------------v-----+            |
|   |          |                    |            |
|   |          |      nftables      |            |
|   |          |   (ip masquerade)  |            |
|   |          +--------------+-----+            |
|   |                         |                  |
|   |   +------+  +-----------v--+               |
|   |   |  lo  |  |     eth0     |               |
|   |   +------+  +--^--------+--+               |
|   |                |        |                  |
|   |          +-----+--------v-----+            |
|   |          |                    |            |
|   +----------+      nftables      |            |
| if with port | (port redirecter)  |            |
|              +-----^--------+-----+            |
|                    |        |                  |
+--------------------+--------+------------------+
                     |        |
                     |        |
                     |        v
                    o u t s i d e

@jenting
Copy link
Contributor Author

jenting commented May 10, 2022

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

I encountered this as well. Debugging...

@jenting jenting closed this as completed May 10, 2022
@jenting jenting reopened this May 10, 2022
Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 10, 2022
@jenting
Copy link
Contributor Author

jenting commented May 11, 2022

Thanks @csweichel. That sounds like the easier way to check supervisor status. I will use this approach.

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

If we switch back to core-dev env, running TestRegularworkspaceTasks against the core-dev env, no more Unauthorized Error. Perhaps it's an environment issue rather than code level issue.

@princerachit
Copy link
Contributor

princerachit commented May 11, 2022

Thanks for your remark @jenting . I tried looking for logs of api server for the k3s installation of preview env but could not see any errors pertaining to the calls that we make in our test.

@gitpod-io/platform would you know why this would happen? I can work together with you to triage the problem.

About using the ide URL: This does not work oob and needs more to be done. The call to the supervisor using ide url fails with the login page. I will find a workaround for this. Might have missed something. Rechecking.

@princerachit
Copy link
Contributor

princerachit commented May 11, 2022

The unauthorized error is indeed affecting other tests as well e.g.

func TestBackup(t *testing.T) {
f := features.New("backup").

=== RUN   TestBackup
=== RUN   TestBackup/backup
=== RUN   TestBackup/backup/it_should_start_a_workspace,_create_a_file_and_successfully_create_a_backup
    content_test.go:50: Could not run copy operation: Unauthorized
--- FAIL: TestBackup (75.36s)
    --- FAIL: TestBackup/backup (75.36s)
        --- FAIL: TestBackup/backup/it_should_start_a_workspace,_create_a_file_and_successfully_create_a_backup (75.35s)

@kylos101 kylos101 moved this from Done to In Progress in 🌌 Workspace Team May 12, 2022
@kylos101
Copy link
Contributor

@princerachit this was marked as done, moving back to in-progress, not sure if you meant to mark as done.

@jenting
Copy link
Contributor Author

jenting commented May 12, 2022

@princerachit this was marked as done, moving back to in-progress, not sure if you meant to mark as done.

I think it's because I close this issue by accident 😅 (clicks the wrong button), and I forget to moving back to in-progress status, sorry about that.

@jenting
Copy link
Contributor Author

jenting commented May 12, 2022

@princerachit If I comment out this line, the integration test pass on preview environment without Unauthorized error.

So, what you assumption is correct, there is a bit different on k3s vs GKE cluster.

@princerachit
Copy link
Contributor

@princerachit If I comment out this line, the integration test pass on preview environment without Unauthorized error.

So, what you assumption is correct, there is a bit different on k3s vs GKE cluster.

I suspect this is probably because we have different method of authorization, with core-dev we use access-token to authorize but with k3s we use client-certificate-data and client-key-data.

@princerachit
Copy link
Contributor

So here is the issue the config rest.Config object being passed to this function can either have an access-token for auth or a TLSConfig containing the client-certificate-data and client-key-data.

When the method is set to access-token we can override the TLSConfig with connection info. However, when the method is client cert and data, overwriting the TLConfig removes the existing creds. Thus, we see the unauthorized error.

config.TLSClientConfig = rest.TLSClientConfig{Insecure: true}

@jenting
Copy link
Contributor Author

jenting commented May 12, 2022

We could delete the line

config.TLSClientConfig = rest.TLSClientConfig{Insecure: true}

because we are able to use secure TLS config to interact with kube-apiserver on core-dev env as well as preview env.

Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 12, 2022
@princerachit princerachit reopened this May 12, 2022
@kylos101 kylos101 moved this from Done to In Progress in 🌌 Workspace Team May 12, 2022
Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
5 participants