[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

svenefftinge · 2022-03-04T09:35:41Z

See also #8592

svenefftinge · 2022-03-07T17:29:56Z

Here's the number of prebuilds failed with "OutOfmemory ... " in the last days

count(id),day
313,2022-03-07
28,2022-03-06
54,2022-03-05
132,2022-03-04
153,2022-03-03
215,2022-03-02
260,2022-03-01
157,2022-02-28
103,2022-02-27
142,2022-02-26
179,2022-02-25
593,2022-02-24
451,2022-02-23
489,2022-02-22
272,2022-02-21
94,2022-02-20
145,2022-02-19
6,2022-02-18
1,2022-01-25

sagor999 · 2022-03-07T17:39:22Z

This is related to this: #8238

For regular workspaces we retry until we get a node that has enough memory.
I wonder if for prebuilds we only try once to schedule the pod and if it fails, we do not attempt again? 🤔

On a positive side: k8s fix for it is already in the works and hopefully will be released soon in the patch version of 1.23 🙏

atduarte · 2022-04-12T10:03:50Z

@sagor999 unfortunately, this is still happening in us39a. I guess the 1.23 patch didn't solve :/ [1]

svenefftinge · 2022-04-12T10:10:55Z

This is still happening regularly and actually twice as often since this week.

anzahl	tag
212	2022-04-12 00:00:00
524	2022-04-11 00:00:00
487	2022-04-10 00:00:00
52	2022-04-09 00:00:00
152	2022-04-08 00:00:00
211	2022-04-07 00:00:00
296	2022-04-06 00:00:00
285	2022-04-05 00:00:00
209	2022-04-04 00:00:00
44	2022-04-03 00:00:00
58	2022-04-02 00:00:00
217	2022-04-01 00:00:00
245	2022-03-31 00:00:00
231	2022-03-30 00:00:00
349	2022-03-29 00:00:00
383	2022-03-28 00:00:00
79	2022-03-27 00:00:00
116	2022-03-26 00:00:00
205	2022-03-25 00:00:00
196	2022-03-24 00:00:00
247	2022-03-23 00:00:00
166	2022-03-22 00:00:00
239	2022-03-21 00:00:00
60	2022-03-20 00:00:00
22	2022-03-19 00:00:00
135	2022-03-18 00:00:00
185	2022-03-17 00:00:00
190	2022-03-16 00:00:00
147	2022-03-15 00:00:00
185	2022-03-14 00:00:00
63	2022-03-13 00:00:00
68	2022-03-12 00:00:00
224	2022-03-11 00:00:00
187	2022-03-10 00:00:00
205	2022-03-09 00:00:00
207	2022-03-08 00:00:00
338	2022-03-07 00:00:00
28	2022-03-06 00:00:00
54	2022-03-05 00:00:00
132	2022-03-04 00:00:00
153	2022-03-03 00:00:00
215	2022-03-02 00:00:00
260	2022-03-01 00:00:00
157	2022-02-28 00:00:00
103	2022-02-27 00:00:00
142	2022-02-26 00:00:00
179	2022-02-25 00:00:00
593	2022-02-24 00:00:00
451	2022-02-23 00:00:00
489	2022-02-22 00:00:00
272	2022-02-21 00:00:00
94	2022-02-20 00:00:00
145	2022-02-19 00:00:00
6	2022-02-18 00:00:00
1	2022-01-25 00:00:00

jmls · 2022-04-12T10:12:06Z

we hit the same issue today

sagor999 · 2022-04-12T19:48:27Z

The real fix for this has not been merged into k8s yet.
kubernetes/kubernetes#108366
It seems to be scheduled to be released with 1.24 version, which is still in beta.

What we have is a workaround, though I am not entirely sure if our workaround is working for prebuilds, as we have been primarily testing it with regular workspaces. 🤔

atduarte · 2022-04-15T22:57:44Z

@kylos101 Scheduled the issue, after seeing these numbers.

2022-04-11	524
2022-04-12	1124
2022-04-13	446
2022-04-14	236
2022-04-15	130 (not final)

aledbf · 2022-04-15T23:21:23Z

@atduarte the final fix for the issue in kubernetes v1.23.6 is scheduled for Tuesday 19th

atduarte · 2022-04-26T10:52:44Z

From what I saw in Kubernetes page, they haven't yet released the fix. Do we have any other alternative?

aledbf · 2022-04-26T10:54:25Z

@atduarte the fix is already deployed in gen42.

sagor999 · 2022-04-26T17:18:12Z

Since actual fix was deployed and is in prod, I am going to close this issue now.

aledbf · 2022-04-27T11:10:33Z

Latest numbers related to the issue:

2022-04-21	205
2022-04-22	160
2022-04-23	2

(gen42 was deployed on the 22nd and the traffic shift finished the 23rd)

svenefftinge added the team: workspace Issue belongs to the Workspace team label Mar 4, 2022

svenefftinge added this to 🌌 Workspace Team Mar 4, 2022

atduarte added type: bug Something isn't working priority: highest (user impact) Directly user impacting labels Apr 12, 2022

atduarte moved this to Scheduled in 🌌 Workspace Team Apr 15, 2022

sagor999 closed this as completed Apr 26, 2022

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

svenefftinge commented Mar 4, 2022

svenefftinge commented Mar 7, 2022 •

edited

Loading

sagor999 commented Mar 7, 2022

atduarte commented Apr 12, 2022 •

edited

Loading

svenefftinge commented Apr 12, 2022

jmls commented Apr 12, 2022

sagor999 commented Apr 12, 2022

atduarte commented Apr 15, 2022

aledbf commented Apr 15, 2022

atduarte commented Apr 26, 2022

aledbf commented Apr 26, 2022

sagor999 commented Apr 26, 2022

aledbf commented Apr 27, 2022 •

edited

Loading

[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

[prebuilds] Many prebuilds fail with "OutOfmemory: Pod Node didn't have enough resource: memory" #8594

Comments

svenefftinge commented Mar 4, 2022

svenefftinge commented Mar 7, 2022 • edited Loading

sagor999 commented Mar 7, 2022

atduarte commented Apr 12, 2022 • edited Loading

svenefftinge commented Apr 12, 2022

jmls commented Apr 12, 2022

sagor999 commented Apr 12, 2022

atduarte commented Apr 15, 2022

aledbf commented Apr 15, 2022

atduarte commented Apr 26, 2022

aledbf commented Apr 26, 2022

sagor999 commented Apr 26, 2022

aledbf commented Apr 27, 2022 • edited Loading

svenefftinge commented Mar 7, 2022 •

edited

Loading

atduarte commented Apr 12, 2022 •

edited

Loading

aledbf commented Apr 27, 2022 •

edited

Loading