Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prebuild] Not able to consistently view logs and start related workspaces #8324

Closed
Tracked by #9291 ...
kylos101 opened this issue Feb 19, 2022 · 14 comments
Closed
Tracked by #9291 ...
Labels
feature: prebuilds meta: stale This issue/PR is stale and will be closed soon team: webapp Issue belongs to the WebApp team

Comments

@kylos101
Copy link
Contributor

Bug description

This happens intermittently to some users. When opening a prebuild to view its logs, they sometimes see the following, the prebuild (which normally takes 10m) times out after an hour, and then the workspace cannot be started:
image

Refreshing the page results in the same screen (it does not reload and then show the editor).

As a workaround, the prebuild can be triggered manually by either:

  1. Building a URL to browse to it gitpod.io/#prebuild/https://github.com/your-org/your-repo
  2. Pushing a new commit

Sometimes the logs are visible, but the workspace cannot be started, and they see an error ERR_SOCKET_TIMEOUT.
image

Steps to reproduce

This appears to be related when a prebuild is doing heavy network traffic, generally with pulling images via docker compose, or npm install.

Workspace affected

n/a

Expected behavior

The prebuild logs must be visible when a prebuild is running.

When a prebuild stops within a workspace, supervisor must reliably terminate the pod, rather than wait for the 1 hour to finish.

Example repository

n/a

Anything else?

While observing this behavior today, we saw a customer workload that normally takes 10 minute enter a state where supervisor became unhealthy after 10 minutes (so roughly when the prebuild would have finished if it were successful).

@kylos101 kylos101 added feature: prebuilds priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team labels Feb 19, 2022
@axonasif
Copy link
Member

@jmls you can subscribe here for updates 🙏

@jldec jldec moved this to Scheduled in 🍎 WebApp Team Feb 24, 2022
@kylos101
Copy link
Contributor Author

@jldec hey there, this could be a bug with supervisor. Looping in @akosyakov and the IDE team, adding to their inbox.

@akosyakov
Copy link
Member

Did someone from @gitpod-io/engineering-webapp already investigated whether we don't lose them somewhere between supervisor and dashboard?

@svenefftinge
Copy link
Member

svenefftinge commented Feb 24, 2022

Really just a theory, but we are currently (over) packing nodes with prebuilds. So that they often run much slower than usual/locally, because they only get 0.7 CPUs (AFAIR).

So my theory is that in cases where a network request is initiated, but then due to CPU scheduling no further activity happens on that thread for a certain time, timeouts could be hit. We have seen lots of reports about very different prebuild execution times.

We should probably not schedule more prebuilds than we have CPU cores, because prebuilds are often CPU bound and in that sense have a very different usage profile than interactive workspaces.

@svenefftinge
Copy link
Member

We should probably not schedule more prebuilds than we have CPU cores, because prebuilds are often CPU bound and in that sense have a very different usage profile than interactive workspaces.

@kylos101 independent of whether this would improve the situation described in this ticket, I believe reducing the density to a point where still 80% of CPU is used most of the time makes sense. cc @csweichel

@akosyakov
Copy link
Member

It seems that it was npm network hiccup in user task. Increasing timeout seem to resolve it: https://gitpod.slack.com/archives/C01KGM9BH54/p1645714873540489?thread_ts=1645712044.304869&cid=C01KGM9BH54

@geropl
Copy link
Member

geropl commented Feb 28, 2022

I'm closing this issue because the network timeout is not an inherent problem by itself, and can be mitigated by increasing npm network timeouts: yarn install --network-timeout 60000

Nonetheless, we should have follow up issues for:

/cc @JanKoehnlein

@geropl geropl closed this as completed Feb 28, 2022
Repository owner moved this from Scheduled to Done in 🍎 WebApp Team Feb 28, 2022
@jmls
Copy link

jmls commented Feb 28, 2022

I strongly disagree with closing this issue. The npm isssue can be mitigated with yarn and timeouts - however we still have major issues with network and disk issues with docker compose which cause our prebuilds to have an 80% + failure rate.

@csweichel csweichel reopened this Feb 28, 2022
Repository owner moved this from Done to In Progress in 🍎 WebApp Team Feb 28, 2022
@csweichel
Copy link
Contributor

Discussion is still on-going if this is really a network issue. We continue to observe the behaviour of prebuilds, specifically looking into yours @jmls.

@kylos101
Copy link
Contributor Author

kylos101 commented Feb 28, 2022

I was able to reproduce the problem where logs cannot be viewed consistently for a running prebuild. It doesn't happen every time. To recreate I did roughly 5-10 commits (pushing on each commit), and for each prebuild, I would view the prebuild logs, go back to my list of prebuilds, and try to view the same prebuild a second time.

  1. Public repo
  2. Prebuild I tried to view
  3. What I saw as a user:
    image

It seems like the response, the instance ID is correct:

{"jsonrpc":"2.0","id":10,"result":{"streams":{"0":"https://gitpod.io/headless-logs/b3bf302e-2d81-4a53-bafa-145b1f9b40c3/f2956fba-fec1-47f5-84bd-c84bf296244b"}}}

My prebuild workspace is running:

$ date
Mon 28 Feb 2022 05:58:59 PM EST
$ kubectl get pod prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3
NAME                                            READY   STATUS    RESTARTS   AGE
prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3   1/1     Running   0          50m

Also, my prebuild task is still running on the pod:

$ ps -axf
    PID TTY      STAT   TIME COMMAND
   2948 pts/1    Ss     0:00 sh
   2954 pts/1    R+     0:00  \_ ps -axf
      1 ?        Ssl    0:00 /.supervisor/workspacekit ring0
     12 ?        Sl     0:14 /proc/self/exe ring1 --mapping-established
     27 ?        Sl     0:00  \_ supervisor init
     58 ?        Sl     0:13  |   \_ supervisor run
     81 pts/0    Ss     0:00  |       \_ /bin/bash
     94 pts/0    S+     0:00  |           \_ /bin/bash -e /workspace/prebuild-experiment/init.sh
     47 ?        S      0:02  \_ /.supervisor/slirp4netns --configure --mtu=65520 --disable-host-loopback --api-socket /tmp/wskit-slirp4netns740493749/slirp4netns.sock 27 tap0

I suspect something is not right with the related terminalId. Unfortunately, I need to go, and cannot track down how the terminalId relates to my workspace. To be continued!

@kylos101
Copy link
Contributor Author

Hey @geropl 👋, is this info helpful for the intermittent issue, which prevents users from viewing active prebuild logs?

Is there anything else you'd like me to check? If yes, I can try recreating this again tomorrow if it helps - let me know.

I didn't get a chance to track down how we get terminalId, or how we're making the connection to the prebuild workspace pod.

@jldec
Copy link
Contributor

jldec commented Mar 1, 2022

Thanks @kylos101 - we're tracking the (mis)behavior of the "Prebuild in Progress" page in #8446.
(I cross linked your comment / investigation there also)

@jldec jldec removed the priority: highest (user impact) Directly user impacting label Mar 1, 2022
@jldec jldec removed this from 🚀 IDE Team Mar 1, 2022
@jldec jldec removed the status in 🍎 WebApp Team Mar 1, 2022
@stale
Copy link

stale bot commented May 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label May 30, 2022
@geropl geropl mentioned this issue May 31, 2022
11 tasks
@laushinka
Copy link
Contributor

I tested this with the same repo that @kylos101 used, made sure it ran for at least 10 minutes, and did this a few times to ensure the consistency. I didn't run into the problem anymore and will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature: prebuilds meta: stale This issue/PR is stale and will be closed soon team: webapp Issue belongs to the WebApp team
Projects
Status: No status
Development

No branches or pull requests

9 participants