-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prebuild] Not able to consistently view logs and start related workspaces #8324
Comments
@jmls you can subscribe here for updates 🙏 |
@jldec hey there, this could be a bug with |
Did someone from @gitpod-io/engineering-webapp already investigated whether we don't lose them somewhere between supervisor and dashboard? |
Really just a theory, but we are currently (over) packing nodes with prebuilds. So that they often run much slower than usual/locally, because they only get 0.7 CPUs (AFAIR). So my theory is that in cases where a network request is initiated, but then due to CPU scheduling no further activity happens on that thread for a certain time, timeouts could be hit. We have seen lots of reports about very different prebuild execution times. We should probably not schedule more prebuilds than we have CPU cores, because prebuilds are often CPU bound and in that sense have a very different usage profile than interactive workspaces. |
@kylos101 independent of whether this would improve the situation described in this ticket, I believe reducing the density to a point where still 80% of CPU is used most of the time makes sense. cc @csweichel |
It seems that it was npm network hiccup in user task. Increasing timeout seem to resolve it: https://gitpod.slack.com/archives/C01KGM9BH54/p1645714873540489?thread_ts=1645712044.304869&cid=C01KGM9BH54 |
I'm closing this issue because the network timeout is not an inherent problem by itself, and can be mitigated by increasing npm network timeouts: Nonetheless, we should have follow up issues for:
/cc @JanKoehnlein |
I strongly disagree with closing this issue. The npm isssue can be mitigated with yarn and timeouts - however we still have major issues with network and disk issues with docker compose which cause our prebuilds to have an 80% + failure rate. |
Discussion is still on-going if this is really a network issue. We continue to observe the behaviour of prebuilds, specifically looking into yours @jmls. |
I was able to reproduce the problem where logs cannot be viewed consistently for a running prebuild. It doesn't happen every time. To recreate I did roughly 5-10 commits (pushing on each commit), and for each prebuild, I would view the prebuild logs, go back to my list of prebuilds, and try to view the same prebuild a second time.
It seems like the response, the instance ID is correct: {"jsonrpc":"2.0","id":10,"result":{"streams":{"0":"https://gitpod.io/headless-logs/b3bf302e-2d81-4a53-bafa-145b1f9b40c3/f2956fba-fec1-47f5-84bd-c84bf296244b"}}} My prebuild workspace is running: $ date
Mon 28 Feb 2022 05:58:59 PM EST
$ kubectl get pod prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3
NAME READY STATUS RESTARTS AGE
prebuild-b3bf302e-2d81-4a53-bafa-145b1f9b40c3 1/1 Running 0 50m Also, my prebuild task is still running on the pod: $ ps -axf
PID TTY STAT TIME COMMAND
2948 pts/1 Ss 0:00 sh
2954 pts/1 R+ 0:00 \_ ps -axf
1 ? Ssl 0:00 /.supervisor/workspacekit ring0
12 ? Sl 0:14 /proc/self/exe ring1 --mapping-established
27 ? Sl 0:00 \_ supervisor init
58 ? Sl 0:13 | \_ supervisor run
81 pts/0 Ss 0:00 | \_ /bin/bash
94 pts/0 S+ 0:00 | \_ /bin/bash -e /workspace/prebuild-experiment/init.sh
47 ? S 0:02 \_ /.supervisor/slirp4netns --configure --mtu=65520 --disable-host-loopback --api-socket /tmp/wskit-slirp4netns740493749/slirp4netns.sock 27 tap0 I suspect something is not right with the related terminalId. Unfortunately, I need to go, and cannot track down how the terminalId relates to my workspace. To be continued! |
Hey @geropl 👋, is this info helpful for the intermittent issue, which prevents users from viewing active prebuild logs? Is there anything else you'd like me to check? If yes, I can try recreating this again tomorrow if it helps - let me know. I didn't get a chance to track down how we get terminalId, or how we're making the connection to the prebuild workspace pod. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I tested this with the same repo that @kylos101 used, made sure it ran for at least 10 minutes, and did this a few times to ensure the consistency. I didn't run into the problem anymore and will close. |
Bug description
This happens intermittently to some users. When opening a prebuild to view its logs, they sometimes see the following, the prebuild (which normally takes 10m) times out after an hour, and then the workspace cannot be started:
Refreshing the page results in the same screen (it does not reload and then show the editor).
As a workaround, the prebuild can be triggered manually by either:
Sometimes the logs are visible, but the workspace cannot be started, and they see an error ERR_SOCKET_TIMEOUT.
Steps to reproduce
This appears to be related when a prebuild is doing heavy network traffic, generally with pulling images via docker compose, or npm install.
Workspace affected
n/a
Expected behavior
The prebuild logs must be visible when a prebuild is running.
When a prebuild stops within a workspace, supervisor must reliably terminate the pod, rather than wait for the 1 hour to finish.
Example repository
n/a
Anything else?
While observing this behavior today, we saw a customer workload that normally takes 10 minute enter a state where supervisor became unhealthy after 10 minutes (so roughly when the prebuild would have finished if it were successful).
The text was updated successfully, but these errors were encountered: