ROX-15980: Set resources requests and limits for fleetshard-sync #990

ludydoo · 2023-04-26T11:42:51Z

Description

As part of app-sre onboarding requirements, the request and limits must be set for all components of the ACSCS service. This PR adds resources limits and requests for the fleetshard-sync component.

It uses sensible values derived from the available metrics.

The maximum observed cpu usage was very low, with a 95p of around 0.1 cpu.
The maximum observed memory usage was also very low, around 150Mi.

Checklist (Definition of Done)

~~Unit and integration tests added~~
~~Added test description under Test manual~~
Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
CI and all relevant tests are passing
Add the ticket number to the PR title if available, i.e. ROX-12345: ...
~~Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.~~
~~Add secret to app-interface Vault or Secrets Manager if necessary~~

…-limits

kylape

Can you give a brief explanation of why you chose the particular requests and limits? My intuition leads me to believe the cpu and memory requirements for fleet manager are quite low, but it'd be nice to have that backed by data.

Also, it looks like your memory requests and limits are not powers of two. I assume they should be, right?

dp-terraform/helm/rhacs-terraform/README.md

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh

porridge · 2023-04-27T06:18:26Z

Can you give a brief explanation of why you chose the particular requests and limits?

See PR description?

porridge

A bunch of nitpicks inline, but LGTM overall.

Also, this is not yet the straw that broke the camel's back, but I think we're now in the range where the length of the --set parameters in the helm command line is getting ridiculous. Would you mind filing a ticket (and putting its ID in a TODO somewhere in terraform_cluster.sh) to restructure this into per-environment values files? Then we could use the same base file to ensure consistency, and override some values on a per-environment basis using per-environment values files as necessary.

dp-terraform/helm/rhacs-terraform/README.md

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh

kovayur

LGTM

openshift-ci · 2023-04-27T07:15:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kovayur, ludydoo, porridge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kovayur,porridge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-04-27T07:46:32Z

New changes are detected. LGTM label has been removed.

ludydoo · 2023-04-27T07:51:20Z

Can you give a brief explanation of why you chose the particular requests and limits? My intuition leads me to believe the cpu and memory requirements for fleet manager are quite low, but it'd be nice to have that backed by data.

Also, it looks like your memory requests and limits are not powers of two. I assume they should be, right?

Yes, I've checked the current fleetshard-sync memory and cpu usage on prometheus. The actual current resource usage is very low. For example, it does not really go north of 128Mi, nor 0.1 CPU. Though, I assumed that for a production workload, we should give it a bit more oumph, just to be on the safe side and try to avoid running into resource problems, since it's a pretty important part of the puzzle.

ludydoo · 2023-04-27T08:27:20Z

/retest

kylape · 2023-05-01T13:40:45Z

See PR description?

🤦‍♂️ not sure how I missed this. Thanks for restating your reasoning, @ludydoo

kylape · 2023-05-01T13:49:28Z

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh

@@ -56,6 +64,10 @@ case $ENVIRONMENT in
    OBSERVABILITY_OPERATOR_VERSION="v4.0.4"
    OPERATOR_USE_UPSTREAM="true"
    OPERATOR_VERSION="v4.0.0"
+    FLEETSHARD_SYNC_CPU_REQUEST="${FLEETSHARD_SYNC_CPU_REQUEST:-"1000m"}"


IMO setting requests to limits for cpu is not as desirable as it is for memory. The impact of cpu overcommit (slowness) is generally not as severe as memory (evictions), and setting cpu requests too high can more easily lead to scheduling issues when the cpu usage on the cluster is relatively low.

I've updated the CPU request to be lower, but kept the memory requests == limits

…-limits

ludydoo added 2 commits April 26, 2023 13:30

ROX-15980 Set resources requests and limits for fleetshard-sync

9041685

ROX-15980 Add documentation

b803d15

ludydoo requested review from porridge and ebensh April 26, 2023 11:42

openshift-merge-robot added the needs-rebase label Apr 26, 2023

Merge branch 'main' into ROX-15980-fleetshard-sync-resources-requests…

5ce17b7

…-limits

ludydoo temporarily deployed to development April 26, 2023 11:47 — with GitHub Actions Inactive

openshift-merge-robot removed the needs-rebase label Apr 26, 2023

ludydoo temporarily deployed to development April 26, 2023 11:47 — with GitHub Actions Inactive

ROX-15980 Quotes

82e53ea

ludydoo temporarily deployed to development April 26, 2023 11:48 — with GitHub Actions Inactive

ludydoo requested a deployment to development April 26, 2023 11:48 — with GitHub Actions Queued

kylape reviewed Apr 26, 2023

View reviewed changes

dp-terraform/helm/rhacs-terraform/README.md Outdated Show resolved Hide resolved

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved

porridge approved these changes Apr 27, 2023

View reviewed changes

dp-terraform/helm/rhacs-terraform/README.md Outdated Show resolved Hide resolved

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved

dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved

openshift-ci bot assigned porridge Apr 27, 2023

openshift-ci bot added lgtm approved labels Apr 27, 2023

ROX-15980 soft-wrap README.md

eb750e6

openshift-ci bot removed the lgtm label Apr 27, 2023

ludydoo temporarily deployed to development April 27, 2023 07:13 — with GitHub Actions Inactive

kovayur approved these changes Apr 27, 2023

View reviewed changes

openshift-ci bot assigned kovayur Apr 27, 2023

openshift-ci bot added the lgtm label Apr 27, 2023

openshift-ci bot removed the lgtm label Apr 27, 2023

ludydoo temporarily deployed to development April 27, 2023 07:46 — with GitHub Actions Inactive

ROX-15980 removed superfluous whitespaces in README.md

6a8d7ae

ludydoo temporarily deployed to development April 27, 2023 07:48 — with GitHub Actions Inactive

Update terraform_cluster.sh

49e529b

ludydoo temporarily deployed to development April 28, 2023 12:11 — with GitHub Actions Inactive

openshift-merge-robot added the needs-rebase label Apr 29, 2023

kylape reviewed May 1, 2023

View reviewed changes

Merge branch 'main' into ROX-15980-fleetshard-sync-resources-requests…

75e0f59

…-limits

ludydoo temporarily deployed to development May 2, 2023 08:58 — with GitHub Actions Inactive

openshift-merge-robot removed the needs-rebase label May 2, 2023

Update terraform_cluster.sh

a5f1e3b

ludydoo temporarily deployed to development May 2, 2023 08:59 — with GitHub Actions Inactive

ludydoo requested a review from kylape May 2, 2023 08:59

ludydoo temporarily deployed to development May 2, 2023 09:00 — with GitHub Actions Inactive

ludydoo merged commit 1d2003e into main May 2, 2023

ludydoo deleted the ROX-15980-fleetshard-sync-resources-requests-limits branch May 2, 2023 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-15980: Set resources requests and limits for fleetshard-sync #990

ROX-15980: Set resources requests and limits for fleetshard-sync #990

ludydoo commented Apr 26, 2023

kylape left a comment

porridge commented Apr 27, 2023

porridge left a comment

kovayur left a comment

openshift-ci bot commented Apr 27, 2023

openshift-ci bot commented Apr 27, 2023

ludydoo commented Apr 27, 2023

ludydoo commented Apr 27, 2023

kylape commented May 1, 2023

kylape May 1, 2023

ludydoo May 2, 2023

ROX-15980: Set resources requests and limits for fleetshard-sync #990

ROX-15980: Set resources requests and limits for fleetshard-sync #990

Conversation

ludydoo commented Apr 26, 2023

Description

Checklist (Definition of Done)

kylape left a comment

Choose a reason for hiding this comment

porridge commented Apr 27, 2023

porridge left a comment

Choose a reason for hiding this comment

kovayur left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Apr 27, 2023

openshift-ci bot commented Apr 27, 2023

ludydoo commented Apr 27, 2023

ludydoo commented Apr 27, 2023

kylape commented May 1, 2023

kylape May 1, 2023

Choose a reason for hiding this comment

ludydoo May 2, 2023

Choose a reason for hiding this comment