Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Metrics support for VPA recommender #5576

Merged
merged 1 commit into from
Sep 4, 2023

Conversation

lallydd
Copy link
Contributor

@lallydd lallydd commented Mar 7, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This adds a mode to the VPA recommender to use External Metrics. It also adds (much more substantially in lines-of-changes) local KIND (Kubernetes IN Docker) support for the recommender in both metrics-server and External Metrics modes.

Extending out the testing to support the rest of the VPA tests is certainly possible, and probably 95% done, but this PR was taking a long time already to get sorted.

Which issue(s) this PR fixes:

Fixes #5153

Special notes for your reviewer:

Does this PR introduce a user-facing change?

ALPHA.  Added External Metrics server support.  To enable this mode, and disable the VPA recommender's use 
of `metrics-server` with  `--use-external-metrics=true` and identify the metrics to use for CPU and memory 
recommendations with `--external-metrics-cpu-metric=YOURMETRIC` and 
`--external-metrics-memory-metric=YOURMETRIC`, respectively.  You must use all three flags together to 
configure the VPA in this way.  This is new functionality and is not as mature or as well-tested as the 
existing `metrics-server` functionality.

Will require additional permissions to use:
- apiGroups:
   - "external.metrics.k8s.io"
  resources:
   - "*"
  verbs:
   - get
   - list 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 7, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @lallydd!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/vertical-pod-autoscaler labels Mar 7, 2023
@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

17 similar comments
@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@voelzmo
Copy link
Contributor

voelzmo commented Mar 9, 2023

Seems like there's some issues with easyCLA: communitybridge/easycla#3843
I contacted sig-contribex about this: https://kubernetes.slack.com/archives/C1TU9EB9S/p1678351853346579

@MadhavJivrajani
Copy link

/close
Going to try closing and re-opening the PR to try to get CLA to refresh

@k8s-ci-robot
Copy link
Contributor

@MadhavJivrajani: Closed this PR.

In response to this:

/close
Going to try closing and re-opening the PR to try to get CLA to refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MadhavJivrajani
Copy link

/reopen
fingers crossed

@k8s-ci-robot k8s-ci-robot reopened this Mar 9, 2023
@lallydd lallydd force-pushed the lally/ext-upstream branch from 6aeaef8 to 379b950 Compare July 5, 2023 18:43
@lallydd lallydd requested a review from jbartosik July 5, 2023 19:29
@lallydd
Copy link
Contributor Author

lallydd commented Jul 5, 2023

Tests pass after another rebase.

Copy link
Contributor

@voelzmo voelzmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current state, the local e2e tests fail with this in the recommender logs

Cannot update VPA vertical-pod-autoscaling-3092/hamster-vpa object. Reason: verticalpodautoscalers.autoscaling.k8s.io "hamster-vpa" is forbidden: User "system:serviceaccount:kube-system:vpa-recommender" cannot patch resource "verticalpodautoscalers/status" in API group "autoscaling.k8s.io" in the namespace "vertical-pod-autoscaling-3092"

Now that #5911 has been merged, the permissions in hack/e2e/vpa-rbac.yaml need to be adapted as well.

This is a bit brittle, people will have to remember to adapt permissions in two places, because it is duplicated. Not sure if there's a good alternative, though.

On a more general note: I really like adding the ability to run the e2e tests locally in a kind cluster! If we get all tests to run in this scenario, this would be very helpful to everyone developing the VPA! I just really wish you would have PR'ed this separately. Right now this is a big chunk with different concerns and hard to review/approve (at least for me, it is).

@lallydd
Copy link
Contributor Author

lallydd commented Jul 6, 2023

That's a really good point. I'm testing a diff/patch based approach now, where deploy-for-e2e-locally.sh generates hack/e2e/vpa-rbac.yaml from deploy/vpa-rbac.yaml with a context diff.

@lallydd lallydd force-pushed the lally/ext-upstream branch from 379b950 to 68c9db8 Compare July 6, 2023 18:15
@lallydd
Copy link
Contributor Author

lallydd commented Jul 6, 2023

@voelzmo It worked! Now this is just the diff necessary: 68c9db8#diff-debb7e710c67f3afe14c9605ad0d7ab3b59f6272a15e21d55679c27b139d732d

And that's all just to enable external metrics, so it's a useful artifact to maintain generally.

@lallydd
Copy link
Contributor Author

lallydd commented Jul 6, 2023

Also added permissions note to the release-notes PR header. Please LMK if that's not the right place for it.

Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at parts of the PR other than the core code change. Mostly they look good but I'm asking for some changes (should be small) or examplations.

Thank you!

kubectl apply -f ${SCRIPT_ROOT}/hack/e2e/k8s-metrics-server.yaml

for i in ${COMPONENTS}; do
if [ $i == admission-controller ] ; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly this should never happen. Check in line 43 ensures that COMPONENTS is recommender or recommender-externalmetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

echo "<suite> should be one of:"
echo " - recommender"
echo " - recommender-externalmetrics"
echo "If component is not specified all above will be started."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be true. If there is no component specified script prints help and exits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 54 to 55
export REGISTRY=localhost:5001
export TAG=latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if the script allowed setting REGISTRY and TAG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now if they're in the environment, they'll get used.

Comment on lines 64 to 69
for i in ${COMPONENTS}; do
if [ $i == admission-controller ] ; then
(cd ${SCRIPT_ROOT}/pkg/${i} && bash ./gencerts.sh || true)
elif [ $i == recommender-externalmetrics ] ; then
i=recommender
fi
ALL_ARCHITECTURES=amd64 make --directory ${SCRIPT_ROOT}/pkg/${i} release REGISTRY=${REGISTRY} TAG=${TAG}
kind load docker-image ${REGISTRY}/vpa-${i}-amd64:${TAG}
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in ${COMPONENTS}; do
if [ $i == admission-controller ] ; then
(cd ${SCRIPT_ROOT}/pkg/${i} && bash ./gencerts.sh || true)
elif [ $i == recommender-externalmetrics ] ; then
i=recommender
fi
ALL_ARCHITECTURES=amd64 make --directory ${SCRIPT_ROOT}/pkg/${i} release REGISTRY=${REGISTRY} TAG=${TAG}
kind load docker-image ${REGISTRY}/vpa-${i}-amd64:${TAG}
done
kind load docker-image ${REGISTRY}/vpa-recommender-amd64:${TAG}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to remove the make too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should stay

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've asked about this but can't find the comment right now: is this a copy of a file from another repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what this diff is for

Copy link
Contributor Author

@lallydd lallydd Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put together a small README. In short, it's what else you need to turn on external metrics which we use for local end-to-end testing. It's automatically applied when running locally. This prevents us needing to put the permission in the normally deployed RBAC yaml -- people can put it in when they want to use this feature.

EDIT: added to existing hack/local-cluster.md doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section "RBAC Changes" under hack/local-cluster.md to explain it.

# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +59 to +63
if [ ${SUITE} == recommender-externalmetrics ]; then
${SCRIPT_ROOT}/hack/run-e2e-tests.sh recommender
else
${SCRIPT_ROOT}/hack/run-e2e-tests.sh ${SUITE}
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [ ${SUITE} == recommender-externalmetrics ]; then
${SCRIPT_ROOT}/hack/run-e2e-tests.sh recommender
else
${SCRIPT_ROOT}/hack/run-e2e-tests.sh ${SUITE}
fi
${SCRIPT_ROOT}/hack/run-e2e-tests.sh recommender

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it harder to add make other components testable locally. Are you sure you want this?

@lallydd lallydd force-pushed the lally/ext-upstream branch from 68c9db8 to de708d2 Compare August 14, 2023 21:11
Copy link
Contributor Author

@lallydd lallydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I got everything, except for replacing the script as we discusssed in this morning's zoom call.

echo "<suite> should be one of:"
echo " - recommender"
echo " - recommender-externalmetrics"
echo "If component is not specified all above will be started."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 54 to 55
export REGISTRY=localhost:5001
export TAG=latest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now if they're in the environment, they'll get used.

kubectl apply -f ${SCRIPT_ROOT}/hack/e2e/k8s-metrics-server.yaml

for i in ${COMPONENTS}; do
if [ $i == admission-controller ] ; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 64 to 69
for i in ${COMPONENTS}; do
if [ $i == admission-controller ] ; then
(cd ${SCRIPT_ROOT}/pkg/${i} && bash ./gencerts.sh || true)
elif [ $i == recommender-externalmetrics ] ; then
i=recommender
fi
ALL_ARCHITECTURES=amd64 make --directory ${SCRIPT_ROOT}/pkg/${i} release REGISTRY=${REGISTRY} TAG=${TAG}
kind load docker-image ${REGISTRY}/vpa-${i}-amd64:${TAG}
done
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to remove the make too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +59 to +63
if [ ${SUITE} == recommender-externalmetrics ]; then
${SCRIPT_ROOT}/hack/run-e2e-tests.sh recommender
else
${SCRIPT_ROOT}/hack/run-e2e-tests.sh ${SUITE}
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it harder to add make other components testable locally. Are you sure you want this?

Copy link
Contributor Author

@lallydd lallydd Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put together a small README. In short, it's what else you need to turn on external metrics which we use for local end-to-end testing. It's automatically applied when running locally. This prevents us needing to put the permission in the normally deployed RBAC yaml -- people can put it in when they want to use this feature.

EDIT: added to existing hack/local-cluster.md doc.

# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be a drop-in replacement. This script generates reasonable resource-usage sizes for every pod, making the existing recommender tests work by giving them resource-use values. As discussed in-meeting, we can work to remove it (by adjusting tests or other infra) in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section "RBAC Changes" under hack/local-cluster.md to explain it.

Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me. A few comments to clean this up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove changes in vertical-pod-autoscaler/e2e/go.sum and vertical-pod-autoscaler/e2e/vendor/ from the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done.

# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a follow up issue to make this more maintainable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any changes to go.mod / go.sum causing this?

Copy link
Contributor Author

@lallydd lallydd Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external_metrics api was referenced in vendor/modules.txt. I think it's just pulled in as part of the existing k8s.io/metrics reference in go.mod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

As discussed at the last SIG meeting we should clearly document flags we add as alpha and communicate clearly that this isn't as mature as metrics server support yet.

https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2023
@lallydd
Copy link
Contributor Author

lallydd commented Aug 22, 2023

I just watched the SIG meeting. Should the flags be documented as alpha under the flag declaration the VPA Recommender README, or somewhere else?

@lallydd lallydd force-pushed the lally/ext-upstream branch from a5148d2 to 034632a Compare August 28, 2023 14:18
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2023
@lallydd
Copy link
Contributor Author

lallydd commented Aug 28, 2023

@jbartosik I added ALPHA designations to the command-line flags and added an ALPHA designation to the release notes. Also this sentence: "This is new functionality and is not as mature or as well-tested as the
existing metrics-server functionality."

Copy link
Collaborator

@jbartosik jbartosik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 4, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik, lallydd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 4, 2023
@k8s-ci-robot k8s-ci-robot merged commit e31031c into kubernetes:master Sep 4, 2023
@raywainman raywainman mentioned this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ways for VPA recommender to support other sources of data.