[prebuild] Not able to consistently view logs and start related workspaces #8324

kylos101 · 2022-02-19T00:08:20Z

Bug description

This happens intermittently to some users. When opening a prebuild to view its logs, they sometimes see the following, the prebuild (which normally takes 10m) times out after an hour, and then the workspace cannot be started:

Refreshing the page results in the same screen (it does not reload and then show the editor).

As a workaround, the prebuild can be triggered manually by either:

Building a URL to browse to it gitpod.io/#prebuild/https://github.com/your-org/your-repo
Pushing a new commit

Sometimes the logs are visible, but the workspace cannot be started, and they see an error ERR_SOCKET_TIMEOUT.

Steps to reproduce

This appears to be related when a prebuild is doing heavy network traffic, generally with pulling images via docker compose, or npm install.

Workspace affected

n/a

Expected behavior

The prebuild logs must be visible when a prebuild is running.

When a prebuild stops within a workspace, supervisor must reliably terminate the pod, rather than wait for the 1 hour to finish.

Example repository

n/a

Anything else?

While observing this behavior today, we saw a customer workload that normally takes 10 minute enter a state where supervisor became unhealthy after 10 minutes (so roughly when the prebuild would have finished if it were successful).

axonasif · 2022-02-20T05:10:33Z

@jmls you can subscribe here for updates 🙏

kylos101 · 2022-02-24T14:12:14Z

@jldec hey there, this could be a bug with supervisor. Looping in @akosyakov and the IDE team, adding to their inbox.

akosyakov · 2022-02-24T14:15:13Z

Did someone from @gitpod-io/engineering-webapp already investigated whether we don't lose them somewhere between supervisor and dashboard?

svenefftinge · 2022-02-24T14:21:48Z

Really just a theory, but we are currently (over) packing nodes with prebuilds. So that they often run much slower than usual/locally, because they only get 0.7 CPUs (AFAIR).

So my theory is that in cases where a network request is initiated, but then due to CPU scheduling no further activity happens on that thread for a certain time, timeouts could be hit. We have seen lots of reports about very different prebuild execution times.

We should probably not schedule more prebuilds than we have CPU cores, because prebuilds are often CPU bound and in that sense have a very different usage profile than interactive workspaces.

svenefftinge · 2022-02-24T15:24:06Z

We should probably not schedule more prebuilds than we have CPU cores, because prebuilds are often CPU bound and in that sense have a very different usage profile than interactive workspaces.

@kylos101 independent of whether this would improve the situation described in this ticket, I believe reducing the density to a point where still 80% of CPU is used most of the time makes sense. cc @csweichel

akosyakov · 2022-02-24T15:31:01Z

It seems that it was npm network hiccup in user task. Increasing timeout seem to resolve it: https://gitpod.slack.com/archives/C01KGM9BH54/p1645714873540489?thread_ts=1645712044.304869&cid=C01KGM9BH54

geropl · 2022-02-28T08:47:02Z

I'm closing this issue because the network timeout is not an inherent problem by itself, and can be mitigated by increasing npm network timeouts: yarn install --network-timeout 60000

Nonetheless, we should have follow up issues for:

Workspace: due to high CPU density prebuilds can start/execute very slowly (might depend on Rate limit prebuilds #5176 first to limit cost impact) @kylos101
WebApp: improve user feedback on prebuilds startup/execution (allocation resources, pulling docker images, etc.) ("Prebuild in Progress" is often misleading or misbehaves. #8446)
WebApp: rate-limiter for prebuilds (Rate limit prebuilds #5176)

/cc @JanKoehnlein

jmls · 2022-02-28T09:03:47Z

I strongly disagree with closing this issue. The npm isssue can be mitigated with yarn and timeouts - however we still have major issues with network and disk issues with docker compose which cause our prebuilds to have an 80% + failure rate.

csweichel · 2022-02-28T09:24:01Z

Discussion is still on-going if this is really a network issue. We continue to observe the behaviour of prebuilds, specifically looking into yours @jmls.

kylos101 · 2022-02-28T23:01:12Z

I was able to reproduce the problem where logs cannot be viewed consistently for a running prebuild. It doesn't happen every time. To recreate I did roughly 5-10 commits (pushing on each commit), and for each prebuild, I would view the prebuild logs, go back to my list of prebuilds, and try to view the same prebuild a second time.

Public repo
Prebuild I tried to view
What I saw as a user:

It seems like the response, the instance ID is correct:

{"jsonrpc":"2.0","id":10,"result":{"streams":{"0":"https://gitpod.io/headless-logs/b3bf302e-2d81-4a53-bafa-145b1f9b40c3/f2956fba-fec1-47f5-84bd-c84bf296244b"}}}

My prebuild workspace is running:

$ date
Mon 28 Feb 2022 05:58:59 PM EST
$ kubectl get pod prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3
NAME                                            READY   STATUS    RESTARTS   AGE
prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3   1/1     Running   0          50m

Also, my prebuild task is still running on the pod:

$ ps -axf
    PID TTY      STAT   TIME COMMAND
   2948 pts/1    Ss     0:00 sh
   2954 pts/1    R+     0:00  \_ ps -axf
      1 ?        Ssl    0:00 /.supervisor/workspacekit ring0
     12 ?        Sl     0:14 /proc/self/exe ring1 --mapping-established
     27 ?        Sl     0:00  \_ supervisor init
     58 ?        Sl     0:13  |   \_ supervisor run
     81 pts/0    Ss     0:00  |       \_ /bin/bash
     94 pts/0    S+     0:00  |           \_ /bin/bash -e /workspace/prebuild-experiment/init.sh
     47 ?        S      0:02  \_ /.supervisor/slirp4netns --configure --mtu=65520 --disable-host-loopback --api-socket /tmp/wskit-slirp4netns740493749/slirp4netns.sock 27 tap0

I suspect something is not right with the related terminalId. Unfortunately, I need to go, and cannot track down how the terminalId relates to my workspace. To be continued!

kylos101 · 2022-02-28T23:10:48Z

Hey @geropl 👋, is this info helpful for the intermittent issue, which prevents users from viewing active prebuild logs?

Is there anything else you'd like me to check? If yes, I can try recreating this again tomorrow if it helps - let me know.

I didn't get a chance to track down how we get terminalId, or how we're making the connection to the prebuild workspace pod.

jldec · 2022-03-01T12:10:56Z

Thanks @kylos101 - we're tracking the (mis)behavior of the "Prebuild in Progress" page in #8446.
(I cross linked your comment / investigation there also)

stale · 2022-05-30T23:07:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

laushinka · 2022-06-10T13:05:46Z

I tested this with the same repo that @kylos101 used, made sure it ran for at least 10 minutes, and did this a few times to ensure the consistency. I didn't run into the problem anymore and will close.

kylos101 added feature: prebuilds priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team labels Feb 19, 2022

kylos101 added this to 🍎 WebApp Team Feb 19, 2022

jldec moved this to Scheduled in 🍎 WebApp Team Feb 24, 2022

kylos101 added this to 🚀 IDE Team Feb 24, 2022

jldec mentioned this issue Feb 24, 2022

"Prebuild in Progress" is often misleading or misbehaves. #8446

Closed

geropl closed this as completed Feb 28, 2022

Repository owner moved this from Scheduled to Done in 🍎 WebApp Team Feb 28, 2022

csweichel reopened this Feb 28, 2022

Repository owner moved this from Done to In Progress in 🍎 WebApp Team Feb 28, 2022

jldec removed the priority: highest (user impact) Directly user impacting label Mar 1, 2022

jldec removed this from 🚀 IDE Team Mar 1, 2022

jldec removed the status in 🍎 WebApp Team Mar 1, 2022

easyCZ mentioned this issue Apr 13, 2022

Epic: Prebuild logs reliability #9291

Closed

5 tasks

stale bot added the meta: stale This issue/PR is stale and will be closed soon label May 30, 2022

geropl mentioned this issue May 31, 2022

Epic: Prebuilds stability #10361

Closed

11 tasks

laushinka closed this as completed Jun 10, 2022

laushinka moved this to Done in 🍎 WebApp Team Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prebuild] Not able to consistently view logs and start related workspaces #8324

[prebuild] Not able to consistently view logs and start related workspaces #8324

kylos101 commented Feb 19, 2022

axonasif commented Feb 20, 2022

kylos101 commented Feb 24, 2022

akosyakov commented Feb 24, 2022

svenefftinge commented Feb 24, 2022 •

edited

Loading

svenefftinge commented Feb 24, 2022

akosyakov commented Feb 24, 2022

geropl commented Feb 28, 2022

jmls commented Feb 28, 2022

csweichel commented Feb 28, 2022

kylos101 commented Feb 28, 2022 •

edited

Loading

kylos101 commented Feb 28, 2022

jldec commented Mar 1, 2022

stale bot commented May 30, 2022

laushinka commented Jun 10, 2022

[prebuild] Not able to consistently view logs and start related workspaces #8324

[prebuild] Not able to consistently view logs and start related workspaces #8324

Comments

kylos101 commented Feb 19, 2022

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

axonasif commented Feb 20, 2022

kylos101 commented Feb 24, 2022

akosyakov commented Feb 24, 2022

svenefftinge commented Feb 24, 2022 • edited Loading

svenefftinge commented Feb 24, 2022

akosyakov commented Feb 24, 2022

geropl commented Feb 28, 2022

jmls commented Feb 28, 2022

csweichel commented Feb 28, 2022

kylos101 commented Feb 28, 2022 • edited Loading

kylos101 commented Feb 28, 2022

jldec commented Mar 1, 2022

stale bot commented May 30, 2022

laushinka commented Jun 10, 2022

svenefftinge commented Feb 24, 2022 •

edited

Loading

kylos101 commented Feb 28, 2022 •

edited

Loading