Fix broken integration test TestRegularWorkspaceTasks #9575

jenting · 2022-04-27T01:32:53Z

Bug description

The integration test TestRegularWorkspaceTasks is broken with the errors

=== RUN   TestRegularWorkspaceTasks
 === RUN   TestRegularWorkspaceTasks/ws-manager
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/init
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/before
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 === RUN   TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/command
     tasks_test.go:96: rpc error: code = Unavailable desc = connection closed
 --- FAIL: TestRegularWorkspaceTasks (23.75s)
     --- FAIL: TestRegularWorkspaceTasks/ws-manager (23.75s)
         --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks (23.75s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/init (9.68s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/before (7.01s)
             --- FAIL: TestRegularWorkspaceTasks/ws-manager/it_can_run_workspace_tasks/command (7.06s)

The analysis root cause could be #8800 (comment)

Steps to reproduce

# Switch to preview environment
./dev/preview/install-k3s-kubeconfig.sh

# Run integration test
cd test
go test -v ./... \
   -kubeconfig=/home/gitpod/.kube/config \
   -namespace=default \
   -run=TestRegularWorkspaceTasks

Tasks

Fix supervisor test
Fix agent part which checks existence of files

Workspace affected

No response

Expected behavior

No response

Example repository

No response

Anything else?

No response

jenting · 2022-04-27T01:39:34Z

In the current code, I encounter the certificate issue, waiting for PR #9553 fixed.
We could revisit this issue once the certificate issue is addressed.

Related Slack Thread.

jenting · 2022-05-03T09:43:54Z

The supervisor opens the port on 22999 on IPv6 localhost address

$ netstat -ntlp  | grep 22999
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 :::22999                :::*                    LISTEN      -

Port forward to address 0.0.0.0 with local port 32999 to remote port 22999

$ kubectl port-forward --address=0.0.0.0 pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999

Access with supervisor endpoint

$ curl -XGET 'http://127.0.0.1:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server

$ k port-forward --address=0.0.0.0 pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 0.0.0.0:32999 -> 22999
Handling connection for 32999
E0503 09:36:56.344879   45463 portforward.go:406] an error occurred forwarding 32999 -> 22999: error forwarding port 22999 to pod 88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753, uid : failed to execute portforward in network namespace "/var/run/netns/cni-1d91c723-efc9-a8d1-ea28-513326b839ed": failed to connect to localhost:22999 inside namespace "88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753", IPv4: dial tcp4 127.0.0.1:22999: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found 
E0503 09:36:56.345241   45463 portforward.go:234] lost connection to pod

I tried change the port-forward with --address=localhost

$ kubectl port-forward --address=localhost pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999
Forwarding from [::1]:32999 -> 22999

The supervisor opens the port on 22999 on IPv6 localhost address, and opened port 32999 on IPv4 and IPv6.

 netstat -tnlp | grep 32999
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.0.5.2:32999          0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:32999         0.0.0.0:*               LISTEN      45942/kubectl       
tcp6       0      0 ::1:32999               :::*                    LISTEN      45942/kubectl

Access with supervisor endpoint

$ curl -XGET 'http://127.0.0.1:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server
$ curl -XGET 'http://[::1]:32999/_supervisor/v1/status/tasks'
curl: (52) Empty reply from server

$ kubectl port-forward --address=localhost pod/ws-87285dbc-7400-4afb-85e8-a6de243cfe5c 32999:22999
Forwarding from 127.0.0.1:32999 -> 22999
Forwarding from [::1]:32999 -> 22999
Handling connection for 32999
E0503 09:42:26.602006   46159 portforward.go:406] an error occurred forwarding 32999 -> 22999: error forwarding port 22999 to pod 88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753, uid : failed to execute portforward in network namespace "/var/run/netns/cni-1d91c723-efc9-a8d1-ea28-513326b839ed": failed to connect to localhost:22999 inside namespace "88503d150548aaf02c1754d273767d2d9c0ffd1ad661a8819218ac91ad236753", IPv4: dial tcp4 127.0.0.1:22999: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found 
E0503 09:42:26.602432   46159 portforward.go:234] lost connection to pod

@gitpod-io/engineering-workspace Any thought or comment?

princerachit · 2022-05-04T11:42:52Z

Hi, I am looking into this error. Will update this issue once I find something.

princerachit · 2022-05-04T15:09:41Z

Workspace logs

The workspace created through this test has the following logs. Notice last line not connected to Gitpod server error.

...
...
...
Web UI available at http://localhost:23000/
[15:06:26] Extension host agent started.
{"ide":"IDE","level":"info","message":"IDE readiness took 5.307 seconds","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"INFO","time":"2022-05-04T15:06:26Z"}
{"ide":"IDE","level":"info","message":"IDE is ready","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"INFO","time":"2022-05-04T15:06:26Z"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"not connected to Gitpod server","level":"error","message":"error tracking supervisor_readiness","serviceContext":{"service":"supervisor","version":"commit-56fe9e79016d84f0e5acff880b6b1fc18e3e9707"},"severity":"ERROR","time":"2022-05-04T15:06:26Z"}

princerachit · 2022-05-04T16:29:22Z

The port-forward will always fail for supervisor port. When kubectl tries to port forward it does that in context of a network namespace. This network namespace is the one which was originally created for the pod. This is the same nw namespace where ./workspacekit ring0 command was run (the entrypoint). workspacekit creates a new set of namespaces in ring1 which in turn creates another set of namespaces in ring2. This ultimate set of namespaces is wherethe supervisor is run see architecture.

Therefore, kubectl does not have access to the namespace where supervisor is running, hence the error.

Proposed solution

Modify the api.Supervisor method to port forward from correct nw ns (PREFERABLE); OR
Find another way to check the status of task

@jenting

correct me if I am wrong @gitpod-io/engineering-workspace

jenting · 2022-05-05T08:42:57Z

The port-forward will always fail for supervisor port. When kubectl tries to port forward it does that in context of a network namespace. This network namespace is the one which was originally created for the pod. This is the same nw namespace where ./workspacekit ring0 command was run (the entrypoint). workspacekit creates a new set of namespaces in ring1 which in turn creates another set of namespaces in ring2. This ultimate set of namespaces is wherethe supervisor is run see architecture.

Therefore, kubectl does not have access to the namespace where supervisor is running, hence the error.

Proposed solution

Modify the api.Supervisor method to port forward from correct nw ns (PREFERABLE); OR

Find another way to check the status of task

@jenting

correct me if I am wrong @gitpod-io/engineering-workspace

Thanks Prince's analysys.
I'd prefer solution 1, but how do we get the supervisor original network namespace?

princerachit · 2022-05-05T11:47:02Z

kubectl cli does not have any configuration option for port-forward to choose the namespace (container namespace not kubernetes namespace) in which the port forwarding should run. Neither the kubelet code takes such configuration parameter.

One way to solve this could be run an exec command through kubectl inorder to expose the port. I still need to figure out how to do this and if this is feasible.

Other way is to make use of integration test agent to enter relevant namespace and then run a curl on the supervisor endpoint.

csweichel · 2022-05-09T12:46:43Z

Could we not get the owner token of the workspace and talk to the supervisor API using the IDE URL?

princerachit · 2022-05-09T14:33:35Z

Thanks @csweichel. That sounds like the easier way to check supervisor status. I will use this approach.

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

utam0k · 2022-05-09T23:30:42Z

Hi, this is a workspace network architecture. It may help you to investigate.

          Pod Network Namespace(ring1)
+------------------------------------------------+
|                                                |
|       Workspace Network Namesapce(ring2)       |
| +--------------------------------------------+ |
| |                                            | |
| |              default via veth0             | |
| |                                            | |
| |                                            | |
| |     +------+  +--------------+             | |
| |     |  lo  |  |    ceth0     | 10.0.2.2/24 | |
| |     +------+  +--^--------+--+             | |
| |                  |        |                | |
| +------------------+--------+----------------+ |
|                    |        |                  |
|                 +--+--------v--+               |
|   +-----------> |    veth0     | 10.0.2.1/24   |
|   |             +-----------+--+               |
|   |                         |                  |
|   |          +--------------v-----+            |
|   |          |                    |            |
|   |          |      nftables      |            |
|   |          |   (ip masquerade)  |            |
|   |          +--------------+-----+            |
|   |                         |                  |
|   |   +------+  +-----------v--+               |
|   |   |  lo  |  |     eth0     |               |
|   |   +------+  +--^--------+--+               |
|   |                |        |                  |
|   |          +-----+--------v-----+            |
|   |          |                    |            |
|   +----------+      nftables      |            |
| if with port | (port redirecter)  |            |
|              +-----^--------+-----+            |
|                    |        |                  |
+--------------------+--------+------------------+
                     |        |
                     |        |
                     |        v
                    o u t s i d e

jenting · 2022-05-10T01:42:54Z

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

I encountered this as well. Debugging...

jenting · 2022-05-11T11:27:30Z

Thanks @csweichel. That sounds like the easier way to check supervisor status. I will use this approach.

I disabled the supervisor bit and proceeded to check on the agent instrumentation part where it checks if the file from the init tasks are created. I saw Unauthorized Error while the agent binary was attempted to be copied to the workspace pod. I am debugging this further.

If we switch back to core-dev env, running TestRegularworkspaceTasks against the core-dev env, no more Unauthorized Error. Perhaps it's an environment issue rather than code level issue.

princerachit · 2022-05-11T13:31:13Z

Thanks for your remark @jenting . I tried looking for logs of api server for the k3s installation of preview env but could not see any errors pertaining to the calls that we make in our test.

@gitpod-io/platform would you know why this would happen? I can work together with you to triage the problem.

About using the ide URL: This does not work oob and needs more to be done. ~~The call to the supervisor using ide url fails with the login page. I will find a workaround for this.~~ Might have missed something. Rechecking.

princerachit · 2022-05-11T14:02:44Z

The unauthorized error is indeed affecting other tests as well e.g.

gitpod/test/tests/components/ws-manager/content_test.go

Lines 23 to 24 in 301190d

    
           func TestBackup(t *testing.T) { 
        
           	f := features.New("backup").

=== RUN   TestBackup
=== RUN   TestBackup/backup
=== RUN   TestBackup/backup/it_should_start_a_workspace,_create_a_file_and_successfully_create_a_backup
    content_test.go:50: Could not run copy operation: Unauthorized
--- FAIL: TestBackup (75.36s)
    --- FAIL: TestBackup/backup (75.36s)
        --- FAIL: TestBackup/backup/it_should_start_a_workspace,_create_a_file_and_successfully_create_a_backup (75.35s)

kylos101 · 2022-05-12T01:54:21Z

@princerachit this was marked as done, moving back to in-progress, not sure if you meant to mark as done.

jenting · 2022-05-12T02:21:07Z

@princerachit this was marked as done, moving back to in-progress, not sure if you meant to mark as done.

I think it's because I close this issue by accident 😅 (clicks the wrong button), and I forget to moving back to in-progress status, sorry about that.

jenting · 2022-05-12T06:02:13Z

@princerachit If I comment out this line, the integration test pass on preview environment without Unauthorized error.

So, what you assumption is correct, there is a bit different on k3s vs GKE cluster.

princerachit · 2022-05-12T06:51:05Z

@princerachit If I comment out this line, the integration test pass on preview environment without Unauthorized error.

So, what you assumption is correct, there is a bit different on k3s vs GKE cluster.

I suspect this is probably because we have different method of authorization, with core-dev we use access-token to authorize but with k3s we use client-certificate-data and client-key-data.

princerachit · 2022-05-12T07:00:55Z

So here is the issue the config rest.Config object being passed to this function can either have an access-token for auth or a TLSConfig containing the client-certificate-data and client-key-data.

When the method is set to access-token we can override the TLSConfig with connection info. However, when the method is client cert and data, overwriting the TLConfig removes the existing creds. Thus, we see the unauthorized error.

gitpod/test/pkg/integration/integration.go

Line 49 in 791163b

config.TLSClientConfig = rest.TLSClientConfig{Insecure: true}

jenting · 2022-05-12T07:23:28Z

We could delete the line

gitpod/test/pkg/integration/integration.go

Line 49 in 791163b

config.TLSClientConfig = rest.TLSClientConfig{Insecure: true}

because we are able to use secure TLS config to interact with kube-apiserver on core-dev env as well as preview env.

This was referenced Apr 27, 2022

Fix broken integration tests TestBaseImageBuild and TestParallelBaseImageBuild #8800

Closed

Epic: Workspace component integration tests #8799

Closed

jenting self-assigned this Apr 27, 2022

jenting added this to 🌌 Workspace Team May 4, 2022

jenting moved this to Blocked in 🌌 Workspace Team May 4, 2022

princerachit assigned princerachit and unassigned jenting May 9, 2022

kylos101 moved this from Blocked to In Progress in 🌌 Workspace Team May 9, 2022

jenting closed this as completed May 10, 2022

jenting reopened this May 10, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 10, 2022

kylos101 moved this from Done to In Progress in 🌌 Workspace Team May 12, 2022

princerachit mentioned this issue May 12, 2022

[integration-test] Partially fix TestRegularWorkspaceTasks #9960

Merged

roboquat closed this as completed in #9960 May 12, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 12, 2022

princerachit reopened this May 12, 2022

kylos101 moved this from Done to In Progress in 🌌 Workspace Team May 12, 2022

princerachit mentioned this issue May 12, 2022

[integration-test] Add supervisor task status test #9968

Merged

roboquat closed this as completed in #9968 May 12, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken integration test TestRegularWorkspaceTasks #9575

Fix broken integration test TestRegularWorkspaceTasks #9575

jenting commented Apr 27, 2022 •

edited by princerachit

Loading

jenting commented Apr 27, 2022

jenting commented May 3, 2022 •

edited

Loading

princerachit commented May 4, 2022

princerachit commented May 4, 2022 •

edited

Loading

princerachit commented May 4, 2022 •

edited

Loading

jenting commented May 5, 2022 •

edited

Loading

Proposed solution

princerachit commented May 5, 2022

csweichel commented May 9, 2022

princerachit commented May 9, 2022

utam0k commented May 9, 2022

jenting commented May 10, 2022

jenting commented May 11, 2022

princerachit commented May 11, 2022 •

edited

Loading

princerachit commented May 11, 2022 •

edited

Loading

kylos101 commented May 12, 2022

jenting commented May 12, 2022 •

edited

Loading

jenting commented May 12, 2022

princerachit commented May 12, 2022

princerachit commented May 12, 2022

jenting commented May 12, 2022

Fix broken integration test TestRegularWorkspaceTasks #9575

Fix broken integration test TestRegularWorkspaceTasks #9575

Comments

jenting commented Apr 27, 2022 • edited by princerachit Loading

Bug description

Steps to reproduce

Tasks

Workspace affected

Expected behavior

Example repository

Anything else?

jenting commented Apr 27, 2022

jenting commented May 3, 2022 • edited Loading

princerachit commented May 4, 2022

princerachit commented May 4, 2022 • edited Loading

Workspace logs

princerachit commented May 4, 2022 • edited Loading

Proposed solution

jenting commented May 5, 2022 • edited Loading

Proposed solution

princerachit commented May 5, 2022

csweichel commented May 9, 2022

princerachit commented May 9, 2022

utam0k commented May 9, 2022

jenting commented May 10, 2022

jenting commented May 11, 2022

princerachit commented May 11, 2022 • edited Loading

princerachit commented May 11, 2022 • edited Loading

kylos101 commented May 12, 2022

jenting commented May 12, 2022 • edited Loading

jenting commented May 12, 2022

princerachit commented May 12, 2022

princerachit commented May 12, 2022

jenting commented May 12, 2022

jenting commented Apr 27, 2022 •

edited by princerachit

Loading

jenting commented May 3, 2022 •

edited

Loading

princerachit commented May 4, 2022 •

edited

Loading

princerachit commented May 4, 2022 •

edited

Loading

jenting commented May 5, 2022 •

edited

Loading

princerachit commented May 11, 2022 •

edited

Loading

princerachit commented May 11, 2022 •

edited

Loading

jenting commented May 12, 2022 •

edited

Loading