Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

Closed
svenefftinge opened this issue Mar 4, 2022 · 12 comments
Labels
priority: highest (user impact) Directly user impacting team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@svenefftinge
Copy link
Member

See also #8592

@svenefftinge svenefftinge added the team: workspace Issue belongs to the Workspace team label Mar 4, 2022
@svenefftinge
Copy link
Member Author

svenefftinge commented Mar 7, 2022

Here's the number of prebuilds failed with "OutOfmemory ... " in the last days

count(id),day
313,2022-03-07
28,2022-03-06
54,2022-03-05
132,2022-03-04
153,2022-03-03
215,2022-03-02
260,2022-03-01
157,2022-02-28
103,2022-02-27
142,2022-02-26
179,2022-02-25
593,2022-02-24
451,2022-02-23
489,2022-02-22
272,2022-02-21
94,2022-02-20
145,2022-02-19
6,2022-02-18
1,2022-01-25

@sagor999
Copy link
Contributor

sagor999 commented Mar 7, 2022

This is related to this: #8238

For regular workspaces we retry until we get a node that has enough memory.
I wonder if for prebuilds we only try once to schedule the pod and if it fails, we do not attempt again? 🤔

On a positive side: k8s fix for it is already in the works and hopefully will be released soon in the patch version of 1.23 🙏

@atduarte
Copy link
Contributor

atduarte commented Apr 12, 2022

@sagor999 unfortunately, this is still happening in us39a. I guess the 1.23 patch didn't solve :/ [1]

@svenefftinge
Copy link
Member Author

This is still happening regularly and actually twice as often since this week.

anzahl	tag
212	2022-04-12 00:00:00
524	2022-04-11 00:00:00
487	2022-04-10 00:00:00
52	2022-04-09 00:00:00
152	2022-04-08 00:00:00
211	2022-04-07 00:00:00
296	2022-04-06 00:00:00
285	2022-04-05 00:00:00
209	2022-04-04 00:00:00
44	2022-04-03 00:00:00
58	2022-04-02 00:00:00
217	2022-04-01 00:00:00
245	2022-03-31 00:00:00
231	2022-03-30 00:00:00
349	2022-03-29 00:00:00
383	2022-03-28 00:00:00
79	2022-03-27 00:00:00
116	2022-03-26 00:00:00
205	2022-03-25 00:00:00
196	2022-03-24 00:00:00
247	2022-03-23 00:00:00
166	2022-03-22 00:00:00
239	2022-03-21 00:00:00
60	2022-03-20 00:00:00
22	2022-03-19 00:00:00
135	2022-03-18 00:00:00
185	2022-03-17 00:00:00
190	2022-03-16 00:00:00
147	2022-03-15 00:00:00
185	2022-03-14 00:00:00
63	2022-03-13 00:00:00
68	2022-03-12 00:00:00
224	2022-03-11 00:00:00
187	2022-03-10 00:00:00
205	2022-03-09 00:00:00
207	2022-03-08 00:00:00
338	2022-03-07 00:00:00
28	2022-03-06 00:00:00
54	2022-03-05 00:00:00
132	2022-03-04 00:00:00
153	2022-03-03 00:00:00
215	2022-03-02 00:00:00
260	2022-03-01 00:00:00
157	2022-02-28 00:00:00
103	2022-02-27 00:00:00
142	2022-02-26 00:00:00
179	2022-02-25 00:00:00
593	2022-02-24 00:00:00
451	2022-02-23 00:00:00
489	2022-02-22 00:00:00
272	2022-02-21 00:00:00
94	2022-02-20 00:00:00
145	2022-02-19 00:00:00
6	2022-02-18 00:00:00
1	2022-01-25 00:00:00

@jmls
Copy link

jmls commented Apr 12, 2022

we hit the same issue today

@atduarte atduarte added type: bug Something isn't working priority: highest (user impact) Directly user impacting labels Apr 12, 2022
@sagor999
Copy link
Contributor

The real fix for this has not been merged into k8s yet.
kubernetes/kubernetes#108366
It seems to be scheduled to be released with 1.24 version, which is still in beta.

What we have is a workaround, though I am not entirely sure if our workaround is working for prebuilds, as we have been primarily testing it with regular workspaces. 🤔

@atduarte
Copy link
Contributor

@kylos101 Scheduled the issue, after seeing these numbers.

2022-04-11	524
2022-04-12	1124
2022-04-13	446
2022-04-14	236
2022-04-15	130 (not final)

@atduarte atduarte moved this to Scheduled in 🌌 Workspace Team Apr 15, 2022
@aledbf
Copy link
Member

aledbf commented Apr 15, 2022

@atduarte the final fix for the issue in kubernetes v1.23.6 is scheduled for Tuesday 19th

@atduarte
Copy link
Contributor

From what I saw in Kubernetes page, they haven't yet released the fix. Do we have any other alternative?

@aledbf
Copy link
Member

aledbf commented Apr 26, 2022

@atduarte the fix is already deployed in gen42.

@sagor999
Copy link
Contributor

Since actual fix was deployed and is in prod, I am going to close this issue now.

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Apr 26, 2022
@aledbf
Copy link
Member

aledbf commented Apr 27, 2022

Latest numbers related to the issue:

2022-04-21	205
2022-04-22	160
2022-04-23	2

(gen42 was deployed on the 22nd and the traffic shift finished the 23rd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: highest (user impact) Directly user impacting team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants