Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-15980: Set resources requests and limits for fleetshard-sync #990

Merged
merged 10 commits into from
May 2, 2023

Conversation

ludydoo
Copy link
Collaborator

@ludydoo ludydoo commented Apr 26, 2023

Description

As part of app-sre onboarding requirements, the request and limits must be set for all components of the ACSCS service. This PR adds resources limits and requests for the fleetshard-sync component.

It uses sensible values derived from the available metrics.

  • The maximum observed cpu usage was very low, with a 95p of around 0.1 cpu.
  • The maximum observed memory usage was also very low, around 150Mi.

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
  • Add secret to app-interface Vault or Secrets Manager if necessary

@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:47 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:47 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:47 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:48 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:48 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 26, 2023 11:48 — with GitHub Actions Inactive
Copy link
Contributor

@kylape kylape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a brief explanation of why you chose the particular requests and limits? My intuition leads me to believe the cpu and memory requirements for fleet manager are quite low, but it'd be nice to have that backed by data.

Also, it looks like your memory requests and limits are not powers of two. I assume they should be, right?

dp-terraform/helm/rhacs-terraform/README.md Outdated Show resolved Hide resolved
dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved
dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved
@porridge
Copy link
Collaborator

Can you give a brief explanation of why you chose the particular requests and limits?

See PR description?

Copy link
Collaborator

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of nitpicks inline, but LGTM overall.

Also, this is not yet the straw that broke the camel's back, but I think we're now in the range where the length of the --set parameters in the helm command line is getting ridiculous. Would you mind filing a ticket (and putting its ID in a TODO somewhere in terraform_cluster.sh) to restructure this into per-environment values files? Then we could use the same base file to ensure consistency, and override some values on a per-environment basis using per-environment values files as necessary.

dp-terraform/helm/rhacs-terraform/README.md Outdated Show resolved Hide resolved
dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved
dp-terraform/helm/rhacs-terraform/terraform_cluster.sh Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm label Apr 27, 2023
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:13 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:13 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:13 — with GitHub Actions Inactive
Copy link
Contributor

@kovayur kovayur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 27, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kovayur, ludydoo, porridge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm label Apr 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 27, 2023

New changes are detected. LGTM label has been removed.

@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:46 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:46 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:46 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:48 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:48 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 27, 2023 07:48 — with GitHub Actions Inactive
@ludydoo
Copy link
Collaborator Author

ludydoo commented Apr 27, 2023

Can you give a brief explanation of why you chose the particular requests and limits? My intuition leads me to believe the cpu and memory requirements for fleet manager are quite low, but it'd be nice to have that backed by data.

Also, it looks like your memory requests and limits are not powers of two. I assume they should be, right?

Yes, I've checked the current fleetshard-sync memory and cpu usage on prometheus. The actual current resource usage is very low. For example, it does not really go north of 128Mi, nor 0.1 CPU. Though, I assumed that for a production workload, we should give it a bit more oumph, just to be on the safe side and try to avoid running into resource problems, since it's a pretty important part of the puzzle.

@ludydoo
Copy link
Collaborator Author

ludydoo commented Apr 27, 2023

/retest

@ludydoo ludydoo temporarily deployed to development April 28, 2023 12:11 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 28, 2023 12:11 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development April 28, 2023 12:11 — with GitHub Actions Inactive
@kylape
Copy link
Contributor

kylape commented May 1, 2023

See PR description?

🤦‍♂️ not sure how I missed this. Thanks for restating your reasoning, @ludydoo

@@ -56,6 +64,10 @@ case $ENVIRONMENT in
OBSERVABILITY_OPERATOR_VERSION="v4.0.4"
OPERATOR_USE_UPSTREAM="true"
OPERATOR_VERSION="v4.0.0"
FLEETSHARD_SYNC_CPU_REQUEST="${FLEETSHARD_SYNC_CPU_REQUEST:-"1000m"}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO setting requests to limits for cpu is not as desirable as it is for memory. The impact of cpu overcommit (slowness) is generally not as severe as memory (evictions), and setting cpu requests too high can more easily lead to scheduling issues when the cpu usage on the cluster is relatively low.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the CPU request to be lower, but kept the memory requests == limits

@ludydoo ludydoo temporarily deployed to development May 2, 2023 08:58 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development May 2, 2023 08:58 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development May 2, 2023 08:58 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development May 2, 2023 08:59 — with GitHub Actions Inactive
@ludydoo ludydoo requested a review from kylape May 2, 2023 08:59
@ludydoo ludydoo temporarily deployed to development May 2, 2023 09:00 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development May 2, 2023 09:00 — with GitHub Actions Inactive
@ludydoo ludydoo merged commit 1d2003e into main May 2, 2023
@ludydoo ludydoo deleted the ROX-15980-fleetshard-sync-resources-requests-limits branch May 2, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants