Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image is crash looping in radix #1128

Closed
fernandolucchesi opened this issue Jun 7, 2022 · 8 comments
Closed

Image is crash looping in radix #1128

fernandolucchesi opened this issue Jun 7, 2022 · 8 comments
Assignees
Labels
🐛 bug Something isn't working

Comments

@fernandolucchesi
Copy link
Contributor

No description provided.

@fernandolucchesi
Copy link
Contributor Author

fernandolucchesi commented Jun 20, 2022

The root cause of this issue is still unknown.
To buy us more investigation time, the number of replicas in production was increased to 3.

Findings:

  • There is a CPU memory spike starting 3 ~ 5 minutes before the crash
  • Server side logs shows a pnpm error
  • Siteimprove scan is OK
  • Crashes are more frequent during day time -> possibly related to user behaviour

Hypothesis:
Crash is related to user behaviour on the client side, because:

  • there are no descriptive server side logs
  • happens when there is more traffic on the website

AND, crash is related to an interaction to some feature/integration (friendlycaptcha, forms, subscription etc), because:

  • doesn't occur in the satellites
  • if error was to be in our code (infinite loop, infinite hook re-render etcs), page would crash and the user would most certainly refresh the page, causing it to break over and over again.
  • due to the previous assumption, client side is not broken right after the interaction, possibly an async request that still exists even after user change route/page

TODO
Test/review:

  • FriendlyCaptcha
  • All Forms
  • Subscription

@fernandolucchesi
Copy link
Contributor Author

fernandolucchesi commented Jul 12, 2022

New findings:

@SvSven
Copy link
Contributor

SvSven commented Sep 6, 2022

First recorded crash alert on Slack was 2021.12.09
Happened again on 2021.12.10, then nothing until 2021.12.20 where it crashed at 20:01 and did not report a resolve until 2021.12.21 at 09:22. It then proceeded to crash 4 times on 2021.12.21

Following that, no reported crashes until 2022.02.10 where it crashes once.

Then nothing until 2022.03.17 - which marks a point where it starts crashing much more frequently.


According to the Slack app configuration, the alert manager integration was created on 2021.03.24 - whether it was fully up and running then already I'm not sure, but I think it's safe to assume the crash on the 9th was the first one.

@SvSven
Copy link
Contributor

SvSven commented Sep 6, 2022

On 2022.09.05 we had 2 replicas for the Brazil website crash at the same time, reported at 04:1 Norwegian time

Following that, we had the studio for Poland and Storage crash on the preprod environment and the web part for the secret site in the prod environment crash at the same time, reported at 05:58. It was confirmed by the the editorial team that no one was using the secret site around that time.

This may indicate that the problem is with Radix and/or Docker. The reason for this being that the secret site has no content, has no traffic, and is only used for preview purposes. Similarly, the Studio's are hosted separated from the web, and the preprod environment is not used by editors - indicating the problem may not be related to user interaction(s).

That being said - we could be dealing with several separate issues. We currently lack sufficient logging/data to really know for sure. The problem could also be PNPM since that is used in the Dockerfiles for both web and Studio.

@nilsml
Copy link
Contributor

nilsml commented Sep 27, 2022

@fernandolucchesi Did you here back from the Dynatrace expert?

@fernandolucchesi
Copy link
Contributor Author

@fernandolucchesi Did you here back from the Dynatrace expert?

Heard today, CCed you on the email.

@fernandolucchesi
Copy link
Contributor Author

@SvSven @nilsml, can you review the PR? I am hoping that not having pnpm on the container will save the day. Container image was also optimised by 8x

@fernandolucchesi
Copy link
Contributor Author

Mystery has finally been solved! 🥳
Refactoring the Dockerfile to stop using pnpm both during the build and the running step did the job! Now we can finally sleep at night 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants