Image is crash looping in radix #1128

fernandolucchesi · 2022-06-07T06:43:53Z

No description provided.

fernandolucchesi · 2022-06-20T08:42:56Z

The root cause of this issue is still unknown.
To buy us more investigation time, the number of replicas in production was increased to 3.

Findings:

There is a CPU memory spike starting 3 ~ 5 minutes before the crash
Server side logs shows a pnpm error
Siteimprove scan is OK
Crashes are more frequent during day time -> possibly related to user behaviour

Hypothesis:
Crash is related to user behaviour on the client side, because:

there are no descriptive server side logs
happens when there is more traffic on the website

AND, crash is related to an interaction to some feature/integration (friendlycaptcha, forms, subscription etc), because:

doesn't occur in the satellites
if error was to be in our code (infinite loop, infinite hook re-render etcs), page would crash and the user would most certainly refresh the page, causing it to break over and over again.
due to the previous assumption, client side is not broken right after the interaction, possibly an async request that still exists even after user change route/page

TODO
Test/review:

FriendlyCaptcha
All Forms
Subscription

fernandolucchesi · 2022-07-12T08:17:00Z

New findings:

Storage crash looped, so this is not related to specific global features
Another person with possibly similar issue: https://equinor.slack.com/archives/CBKM6N2JY/p1657523210274699?thread_ts=1657523210274699&cid=CBKM6N2JY

SvSven · 2022-09-06T10:40:26Z

First recorded crash alert on Slack was 2021.12.09
Happened again on 2021.12.10, then nothing until 2021.12.20 where it crashed at 20:01 and did not report a resolve until 2021.12.21 at 09:22. It then proceeded to crash 4 times on 2021.12.21

Following that, no reported crashes until 2022.02.10 where it crashes once.

Then nothing until 2022.03.17 - which marks a point where it starts crashing much more frequently.

According to the Slack app configuration, the alert manager integration was created on 2021.03.24 - whether it was fully up and running then already I'm not sure, but I think it's safe to assume the crash on the 9th was the first one.

SvSven · 2022-09-06T10:52:48Z

On 2022.09.05 we had 2 replicas for the Brazil website crash at the same time, reported at 04:1 Norwegian time

Following that, we had the studio for Poland and Storage crash on the preprod environment and the web part for the secret site in the prod environment crash at the same time, reported at 05:58. It was confirmed by the the editorial team that no one was using the secret site around that time.

This may indicate that the problem is with Radix and/or Docker. The reason for this being that the secret site has no content, has no traffic, and is only used for preview purposes. Similarly, the Studio's are hosted separated from the web, and the preprod environment is not used by editors - indicating the problem may not be related to user interaction(s).

That being said - we could be dealing with several separate issues. We currently lack sufficient logging/data to really know for sure. The problem could also be PNPM since that is used in the Dockerfiles for both web and Studio.

nilsml · 2022-09-27T07:56:14Z

@fernandolucchesi Did you here back from the Dynatrace expert?

fernandolucchesi · 2022-09-28T23:58:53Z

@fernandolucchesi Did you here back from the Dynatrace expert?

Heard today, CCed you on the email.

fernandolucchesi · 2022-10-12T13:38:07Z

@SvSven @nilsml, can you review the PR? I am hoping that not having pnpm on the container will save the day. Container image was also optimised by 8x

fernandolucchesi · 2022-11-01T15:47:07Z

Mystery has finally been solved! 🥳
Refactoring the Dockerfile to stop using pnpm both during the build and the running step did the job! Now we can finally sleep at night 😄

fernandolucchesi added the 🐛 bug Something isn't working label Jun 7, 2022

fernandolucchesi self-assigned this Jun 7, 2022

nilsml added a commit that referenced this issue Jun 13, 2022

⬆️ Upgrade Dockerfile to use pnpm 7.2.1 (#1128)

4e034d9

nilsml added a commit that referenced this issue Jun 13, 2022

⬆️ Upgrade actions to latest version of pnpm (#1128)

5624e47

wenche assigned SvSven, wenche and nilsml Jun 14, 2022

nilsml mentioned this issue Jun 14, 2022

Upcoming update for the contact us form #1087

Closed

fernandolucchesi added a commit that referenced this issue Jun 16, 2022

📌 Nuke and regenerate pnpm lock files #1128

3f30b07

fernandolucchesi added a commit that referenced this issue Jun 17, 2022

✨ Check archived slugs before fetching azure #1128

5552ed8

SvSven added a commit that referenced this issue Jun 21, 2022

🔊 add error logging to FriendlyCaptcha (#1128)

a45f443

fernandolucchesi added a commit that referenced this issue Jun 22, 2022

✨ Fallback to other locales for archived news #1128

a1e5e91

SvSven added a commit that referenced this issue Jun 28, 2022

🔊 add error logging to FriendlyCaptcha (#1128)

b039897

SvSven added the 🔁 ongoing label Jun 28, 2022

SvSven added a commit that referenced this issue Jun 30, 2022

🔊 improve error logging (#1128)

6c39e28

fernandolucchesi added a commit that referenced this issue Jul 7, 2022

⚗️ Use preprod for preview #1128

b12da16

fernandolucchesi removed the 🔁 ongoing label Jul 22, 2022

SvSven unassigned wenche Oct 12, 2022

fernandolucchesi added a commit that referenced this issue Oct 12, 2022

♻️ Refactor dockerfile and upgrade next #1128

80b4e8b

fernandolucchesi added a commit that referenced this issue Oct 13, 2022

🔀 Merge main #1128

38262b9

fernandolucchesi added a commit that referenced this issue Oct 13, 2022

♻️ Refactor dockerfile and upgrade next #1128 (#1308)

ed39012

fernandolucchesi added a commit that referenced this issue Oct 13, 2022

🐛 Add missing envs #1128

ca55e33

fernandolucchesi added a commit that referenced this issue Oct 13, 2022

🐛 Add missing env #1128

ddb2cce

fernandolucchesi closed this as completed Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image is crash looping in radix #1128

Image is crash looping in radix #1128

fernandolucchesi commented Jun 7, 2022

fernandolucchesi commented Jun 20, 2022 •

edited

Loading

fernandolucchesi commented Jul 12, 2022 •

edited

Loading

SvSven commented Sep 6, 2022

SvSven commented Sep 6, 2022

nilsml commented Sep 27, 2022

fernandolucchesi commented Sep 28, 2022

fernandolucchesi commented Oct 12, 2022

fernandolucchesi commented Nov 1, 2022

Image is crash looping in radix #1128

Image is crash looping in radix #1128

Comments

fernandolucchesi commented Jun 7, 2022

fernandolucchesi commented Jun 20, 2022 • edited Loading

fernandolucchesi commented Jul 12, 2022 • edited Loading

SvSven commented Sep 6, 2022

SvSven commented Sep 6, 2022

nilsml commented Sep 27, 2022

fernandolucchesi commented Sep 28, 2022

fernandolucchesi commented Oct 12, 2022

fernandolucchesi commented Nov 1, 2022

fernandolucchesi commented Jun 20, 2022 •

edited

Loading

fernandolucchesi commented Jul 12, 2022 •

edited

Loading