Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of users that GESIS server can have #3080

Open
rgaiacs opened this issue Aug 29, 2024 · 2 comments
Open

Reduce the number of users that GESIS server can have #3080

rgaiacs opened this issue Aug 29, 2024 · 2 comments
Assignees

Comments

@rgaiacs
Copy link
Collaborator

rgaiacs commented Aug 29, 2024

Related to #3056

@arnim and I will reduce the maximum number of users that GESIS server can have.

At the moment, GESIS launch quota is 250 as configured in https://github.com/gesiscss/orc2/blob/e2fdacc0f3e0f8ab9aaff943ac78af2a9702e153/helm/gesis-config/production.yaml#L11.

@arnim and I need to find a new value for the "launch quota equilibrium", i.e. the launch quota that allow us to continue operating when downloading container images without getting pending / terminating pods out of control as in the following screenshot.

Screenshot 2024-08-29 at 20-03-04 Pod Activity - Dashboards - Grafana

@arnim and I will also reduce GESIS weigh contribution to the federation, defined in

during the search for the new "launch quota equilibrium".

"Launch Quota Equilibrium" Search Strategy

The follow table will be updated as the search progress.

Date Time Launch Quota Federation Contribution Notes
2024-08-29 21:00 UTC+2 40 60 To clean the pending / terminating pods
2024-08-30 09:00 UTC+2 100 60
2024-08-30 21:00 UTC+2 120 60
2024-08-31 09:00 UTC+2 120 60
2024-08-31 21:00 UTC+2 120 60
2024-09-01 09:00 UTC+2 120 60
2024-09-02 09:00 UTC+2 120 60
2024-09-02 21:00 UTC+2 ?? 60
2024-09-03 21:00 UTC+2 ?? 60
2024-09-03 21:00 UTC+2 ?? 60

From 2024-08-29 21:00 UTC+2 until 2024-08-30 09:00 UTC+2

The high number of pending pods were clear.

Screenshot 2024-08-30 at 08-44-37 Pod Activity - Starred - Grafana

From 2024-08-30 09:00 UTC+2 until 2024-08-30 21:00 UTC+2

Because it was not possible for me to be online at 2024-08-30 21:00 UTC+2, the change was done a bit earlier.

Screenshot 2024-08-30 at 19-02-10 Pod Activity - Starred - Grafana

The increase of pending pods at 2024-08-30 14:56:00 was because of a stress test conducted by @arnim. The server still has problems when it need to download many container images from Docker Hub.

From 2024-08-30 21:00 UTC+2 until 2024-08-31 09:00 UTC+2

Screenshot 2024-08-31 at 08-06-52 Pod Activity - Dashboards - Grafana

I don't know what happen at 2024-08-30 22:42:00 that the number of running pods drop.

Also, I don't know what happen at 2024-08-30 23:46:00 that the number of pending pods increase.

I'm running the same configuration for another 12 hours.

From 2024-08-31 09:00 UTC+2 until 2024-08-31 21:00 UTC+2

Screenshot 2024-09-02 at 10-12-40 Pod Activity - Starred - Grafana

From 2024-08-31 21:00 UTC+2 until 2024-09-01 09:00 UTC+2

Screenshot 2024-09-02 at 10-14-40 Pod Activity - Starred - Grafana

Around 2024-08-31 23:22:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.

Does this has some correlation with containerd cleaning the local cache?

From 2024-09-01 09:00 UTC+2 until 2024-09-01 21:00 UTC+2

Screenshot 2024-09-02 at 10-21-24 Pod Activity - Starred - Grafana

Around 2024-09-01 19:18:00, the number of pending pods started to increase. This is correlated with a increase in the number of container image pulls. No increase in the number of image builds.

Does this has some correlation with containerd cleaning the local cache?

From 2024-09-01 21:00 UTC+2 until 2024-09-02 09:00 UTC+2

Screenshot 2024-09-02 at 10-34-10 Pod Activity - Starred - Grafana

Between 2024-09-01 22:00:00 and 2024-09-02 01:00:00, the number of running pods were very low. This was probably something going wrong with the server that it could not download container images.

Screenshot 2024-09-02 at 10-40-10 Overview - notebooks gesis org - Dashboards - Grafana

Screenshot 2024-09-02 at 10-36-27 1  Overview - Starred - Grafana

@rgaiacs rgaiacs self-assigned this Aug 29, 2024
rgaiacs added a commit to gesiscss/mybinder.org-deploy that referenced this issue Aug 29, 2024
rgaiacs added a commit to gesiscss/orc2 that referenced this issue Aug 30, 2024
rgaiacs added a commit to gesiscss/orc2 that referenced this issue Aug 30, 2024
@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 2, 2024

At 2024-09-02 15:33:00, @arnim and I started a stress test on the GESIS server where we requested the build of 20 new container images.

Screenshot 2024-09-02 at 16-52-05 Overview - notebooks gesis org - Dashboards - Grafana

The request to build the containers is visible in the second chart as a big spike. Around the same time, we see an increase of the number of pending pods. The number of pending pods has two plateau. Around 2024-09-02 15:43:00, Kubernetes could not pull the image for any of the 20 new container images. Kubernetes terminated the pods and JupyterHub tried a new launch. This translated in 20 Terminating pods + 20 Pending pods. Later, Kubernetes will terminate the pods because couldn't pull the images and the Kubernetes cluster will have 40 Terminating pods.

Because the pods have references to requests to pull the images, Kubernetes garbage collector is unable to delete the pods until that the reference to pull the images is also deleted.

Around 2024-09-02 16:06:15, Kubernetes started to download the container images. From Kubernetes Events Logs

Container image Download time Waiting time
gesiscss/binder-r2d-g5b5b759-dgothrek-2dipyaggrid-aa11bb:94f3d74bce0a0a0434248824f1071998c9c105fa 11.419s 0
gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1 3.941s 15min
gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1 933ms 15min
"gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9 910ms 28min

@arnim the waiting time is the problem. Maybe because of some API limit. But we have the imagePullSecret configured. I will try to check the metadata of the pods tomorrow.

Kubernetes Events Logs

27m                 Normal    Pulled             Pod/jupyter-dgothrek-2dipyaggrid-2d5hn841ym                        Successfully pulled image "gesiscss/binder-r2d-g5b5b759-dgothrek-2dipyaggrid-aa11bb:94f3d74bce0a0a0434248824f1071998c9c105fa" in 11.419s (11.42s including waiting)
32m                 Normal    Pulled             Pod/jupyter-eichstaedtptb-2dmontecarlohandson-2d4msqp66x           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1" in 3.941s (16m52.737s including waiting)
32m                 Normal    Pulled             Pod/jupyter-eichstaedtptb-2dmontecarlohandson-2dts191lx3           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-eichstaedtptb-2dmontecarlohandson-109259:6693feae52e23a19987787c7cc3761e258b35be1" in 933ms (16m34.432s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d1yh4r76x              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 910ms (28m50.629s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d4jomepwv              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:854b5c4dc0a9d8702b651730dcb73dbf1824ebab" in 3.673s (31m21.81s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d6uvfy8gm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 3.175s (31m24.983s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d86sercph              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:dc64e6925e035e4b37091d4c30ca8b0571e5ac17" in 4.791s (27m36.887s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d8s0ug3qu              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:a8e94343385da7cea277684fa73a9a8e05e9b116" in 3.499s (31m37.303s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d8wb2y3na              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 911ms (28m28.346s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2d9ycn1fr2              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 924ms (28m47.51s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2daucf374h              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 915ms (36m32.659s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dc02g05lb              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:e287c8967d444edca30f6af4fbb60adfd02f72b4" in 3.091s (31m39.378s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2ddm1w12hm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 1.002s (28m25.668s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2de9qli7tn              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 990ms (28m50.25s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2deu3ebjvl              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:c3c3073488f9f0d3627daf989a4e0426ea8bc7f0" in 2.833s (31m45.505s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dfkdoj0av              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:4e1e19ea790364eb539e7a9e89f6fa8aaf30d281" in 876ms (28m51.123s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dlmbwcn06              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 979ms (36m31.747s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmmysoi22              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 929ms (28m24.669s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmpp2elqx              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:bf354d4890a11183f6fb7d2bf8ab62e4f3773234" in 2.651s (30m27.059s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmrrs413l              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 892ms (28m26.557s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dmrynnzv2              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 890ms (28m49.262s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dnmdrkpmc              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:4e1e19ea790364eb539e7a9e89f6fa8aaf30d281" in 2.643s (29m9.898s including waiting)
38m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dnxld8wi6              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:82f0c4db07f88ec6dc760e44acd004bff5735cf4" in 930ms (36m33.587s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpayjz2ut              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:16f92ec3605f141e357474f6f48643e9466e88db" in 2.814s (31m31.799s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpoh4qhom              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:f90079cb5d89633cbba57b5d65cd2c5df17df726" in 2.675s (31m42.045s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dpy15hb30              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:c3c3073488f9f0d3627daf989a4e0426ea8bc7f0" in 866ms (28m48.375s including waiting)
50m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dqq7imv0z              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:fb5289a410eb1e13f64aff806d0e0daa4b7bbb91" in 2.767s (29m7.258s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dsfecfjr3              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:f8ae18112210b0f84f826d78d7975cbe648e7416" in 2.734s (30m21.966s including waiting)
48m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dvuja3dqm              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:453f79ca0cdd64b1820d7b38e344b2d5b4a3fab8" in 2.451s (30m24.415s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dwnhz1pe6              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:22648634e310715e1d34509e51a158a1ca9c94c5" in 2.666s (31m42.678s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dx0o9ibpl              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:644b2e554af0e9ecc76752b61f0647eea09179e9" in 3.031s (31m33.809s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dx6pvsjpx              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:11d3e7147788113abe67dc0b05ac9ea3cced67be" in 2.873s (31m47.355s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dyt0ltbct              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:5cdaf52101076fa1d85b2852aba9e0b018c1fde1" in 2.657s (31m26.613s including waiting)
52m                 Normal    Pulled             Pod/jupyter-gesiscss-2dorc2-2dtest-2dbuild-2dzn3qtbmn              Successfully pulled image "gesiscss/binder-r2d-g5b5b759-gesiscss-2dorc2-2dtest-2dbuild-ced273:6780067c979309d8aeb6dde8d6b7a138c831f38f" in 3.414s (31m29.011s including waiting)
24m                 Normal    Pulled             Pod/jupyter-giswqs-2dwhitebox-2drust-2dbinder-2d30nt0ab6           Successfully pulled image "gesiscss/binder-r2d-g5b5b759-giswqs-2dwhitebox-2drust-2dbinder-fd7fcc:5f82299166f6e41bbf13dafa18c91cf78cc66dd9" in 1m57.875s (1m57.875s including waiting)

@arnim
Copy link
Contributor

arnim commented Sep 2, 2024

Thank you @rgaiacs for that very detailed description. We are closing in on the problem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants