Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error loading prebuild #11852

Closed
aledbf opened this issue Aug 3, 2022 · 26 comments · Fixed by #12215
Closed

Unexpected error loading prebuild #11852

aledbf opened this issue Aug 3, 2022 · 26 comments · Fixed by #12215
Labels
blocked component: ws-daemon meta: stale This issue/PR is stale and will be closed soon type: bug Something isn't working

Comments

@aledbf
Copy link
Member

aledbf commented Aug 3, 2022

Bug description

rpc error: code = FailedPrecondition desc = cannot initialize workspace: prebuild initializer: Git fallback: git initializer gitClone: mkdir /dst/spring-petclinic: no such file or directory

Jaeger-UI (4)

Workspace affected

gitpodio-springpetclini-8a38a5a57eu

Expected behavior

  1. Log when this happens (we don't now).
  2. Ideally we'd wait longer until the file is ready, before trying to clone

Example repository

None

Anything else?

How long do we wait before the file system is ready?

@sagor999
Copy link
Contributor

sagor999 commented Aug 4, 2022

We already have log entry for this:

log.WithError(err).WithField("location", ws.Location).Error("cannot create directory")

Also looking at logs, it looks like workspace failed, and then 5 minutes later ws-daemon tried to run initializer for it (???):
image

I cannot quite understand what exactly happened to that workspace lifecycle.

@kylos101
Copy link
Contributor

kylos101 commented Aug 8, 2022

Thanks for looking at this one, @sagor999 ! @jenting prior to resuming PVC work, could you peek at this to see what you can find? It's the last of broken windows we found from gen59 traces.

@jenting jenting self-assigned this Aug 8, 2022
@jenting jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Aug 8, 2022
@jenting
Copy link
Contributor

jenting commented Aug 8, 2022

  1. Log when this happens (we didn't for now).

We log it already

  1. Ideally we'd wait longer until the file is ready, before trying to clone

We haven't run git clone; the error reports on os.MkdirAll(ws.Location, 0775) failed 🤔


I thought the error happens at this line

log.WithError(err).WithField("location", ws.Location).Error("cannot create directory")

However, I did not find the span isGitWS in the tracing

span.SetTag("isGitWS", isGitWS)
🤔

@jenting jenting moved this from In Progress to Scheduled in 🌌 Workspace Team Aug 10, 2022
@jenting jenting assigned jenting and unassigned jenting Aug 10, 2022
@jenting jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Aug 10, 2022
@jenting
Copy link
Contributor

jenting commented Aug 11, 2022

I'm blocked for this issue now.

I can't reproduce it locally to make the os.MkdirAll(ws.Location, 0775) failed with the reason no such file or directory. Is this because the mount point /dst/ is not ready yet?

@jenting jenting moved this from In Progress to Scheduled in 🌌 Workspace Team Aug 11, 2022
@jenting jenting removed their assignment Aug 11, 2022
@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Aug 11, 2022
@kylos101
Copy link
Contributor

👋 @jenting were you able to find anything meaningful via Google searches, or recreate similar misbehavior in https://go.dev/play/? I ask so that we can have that context when sharing this issue with a teammate next week.

For now, let's leave blocked. And later this week I'll inspect frequency for this error. Frequency will determine if we reassign to another teammate while you're out on vacation, etc.

@jenting
Copy link
Contributor

jenting commented Aug 11, 2022

👋 @jenting were you able to find anything meaningful via Google searches, or recreate similar misbehavior in https://go.dev/play/? I ask so that we can have that context when sharing this issue with a teammate next week.

I did some google searching and write a similar code locally to reproduce the error. But no luck so far.

@jenting
Copy link
Contributor

jenting commented Aug 12, 2022

This is odd. From the log, the error comes from here.
However, I can't see this line warning log within GCP log.

Note: we filter by instanceId="a7ad0fe1-3ebc-4786-a207-366f8c7c1e47"

@kylos101
Copy link
Contributor

@jenting are you still blocked and need help from the team (if yes please reach out in #t_workspace), or, do you have more info to go on now because of this thread?

@utam0k
Copy link
Contributor

utam0k commented Aug 19, 2022

If this PR doesn't fix this problem, we have to write code to check if the container is still alive.
#12215

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Aug 22, 2022
@sagor999
Copy link
Contributor

This is related to this: #12282
If StopWorkspace was called while workspace was still doing content init, then it may fail with this exact error, as ws-daemon does not know that workspace was stopped and /dst has disappeared.

@utam0k
Copy link
Contributor

utam0k commented Aug 23, 2022

This is related to this: #12282 If StopWorkspace was called while workspace was still doing content init, then it may fail with this exact error, as ws-daemon does not know that workspace was stopped and /dst has disappeared.

I just wonder who deletes $wsRoot/dst? kubelet or our component? Do you know?

@sagor999
Copy link
Contributor

Could be that our housekeeping job in ws-daemon does that? 🤔

@jenting
Copy link
Contributor

jenting commented Aug 23, 2022

I just wonder who deletes $wsRoot/dst? kubelet or our component? Do you know?

I have the same question.

Since we are not sure whether #12282 addressed this issue or not, we might need to consider reopening this issue.

@utam0k
Copy link
Contributor

utam0k commented Aug 23, 2022

There are probably two patterns in this issue

I put the log links of those two patterns in this PR. Perhaps this PR will improve both, but it is unclear if they will be resolved.
#12215

So, I think if it happens again, we should reopen.

Perhaps this PR will improve both, but it is unclear if they will be resolved.

@jenting jenting removed the blocked label Aug 23, 2022
@jenting jenting moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Aug 25, 2022
@jenting
Copy link
Contributor

jenting commented Aug 25, 2022

We need to check the jaeger tracing on the gen63 cluster to see if it still happens or not.

@utam0k
Copy link
Contributor

utam0k commented Aug 29, 2022

It still happens 😭
https://cloudlogging.app.goo.gl/4Hk68KGGpKBS1wyk9

@utam0k utam0k reopened this Aug 29, 2022
@utam0k utam0k moved this from In Validation to Breakdown in 🌌 Workspace Team Aug 29, 2022
@kylos101
Copy link
Contributor

kylos101 commented Sep 6, 2022

FYI, we need webapp to add logging, so that we can know "why" stop workspace is being called via #12282. Once that is done, then we can proceed with this particular issue.

@kylos101
Copy link
Contributor

kylos101 commented Sep 8, 2022

Added Blocked label, because we're waiting for webapp to schedule and do logging in #12282

@kylos101
Copy link
Contributor

This is no longer blocked as of #12283

@kylos101 kylos101 removed the blocked label Sep 20, 2022
@atduarte
Copy link
Contributor

During the refinement meeting, @utam0k mentioned he saw the error recently. We don't know exactly how to approach other than looking at the new logs and trying to understand what's going on. There's currently no hypothesis.

@atduarte atduarte moved this from Breakdown to Scheduled in 🌌 Workspace Team Sep 27, 2022
@kylos101
Copy link
Contributor

kylos101 commented Oct 4, 2022

@jenting @utam0k As this is Scheduled, and not In-progress, I removed you both from assigned. This way, it is "free" for later, when someone has bandwidth, then it can be assigned and status changed accordingly. 😄 Have a nice day you two! 👋

@utam0k
Copy link
Contributor

utam0k commented Oct 4, 2022

@kylos101 💯 Thanks

@kylos101
Copy link
Contributor

@sagor999 could you peek at the new logs, to see why the workspaces are stopping, to help form a plan of attack for this? I'm going to move this from Scheduled to the Inbox for now.

@sagor999
Copy link
Contributor

Hm. I looked in traces (US) and looked in GCP logs for that error and cannot find one.
🤔

@stale
Copy link

stale bot commented Jan 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jan 16, 2023
@utam0k
Copy link
Contributor

utam0k commented Jan 17, 2023

image

There are a few error messages so I closed

@utam0k utam0k closed this as completed Jan 17, 2023
@github-project-automation github-project-automation bot moved this to Awaiting Deployment in 🌌 Workspace Team Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked component: ws-daemon meta: stale This issue/PR is stale and will be closed soon type: bug Something isn't working
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants