Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request kubernetes#184 from elankath/sync-upstream-v1.26.0
* Drop redudant parameter in utilization calculation * Extract checks for scale down eligibility * Limit amount of node utilization logging * Increase timeout for VPA E2E After kubernetes#5151 e2e are still failing because we're still hitting ginkgo timeout * Add podScaleUpDelay annotation support * Corrected the links for Priority in k8s API and Pod Preemption in k8s. * Restrict Updater PodLister to namespace * Update controller-gen to latest and use go install * Run hack/generate-crd-yaml.sh * update owners list for cluster autoscaler azure * Change VPA default version to 0.12.0 * Pin controller-gen to 0.9.2 * AWS ReadMe update * Move resource limits checking to a separate package * Allow simulator to persist changes in cluster snapshot * Don't depend on IsNodeBeingDeleted implementation The fact that it only considers nodes as deleted only until a certain timeout is of no concern to the eligibility.Checker. * Stop treating masters differently in scale down This filtering was used for two purposes: - Excluding masters from destination candidates - Excluding masters from calculating cluster resources Excluding from destination candidates isn't useful: if pods can schedule there, they will, so removing them from CA simulation doesn't change anything. Excluding from calculating cluster resources actually matches scale up behavior, where master nodes are treated the same way as regular nodes. * CA - AWS - Instance List Update 2022-09-16 * fix typo * Modifying taint removal logic on startup to consider all nodes instead of ready nodes. * fix typo * Update VPA compatibility for 0.12 release * Updated the golang version for GitHub workflow. * Create GCE CloudProvider Owners file * Fix error formatting in GCE client %v results in a list of numbers when byte array is passed * Introduce NodeDeleterBatcher to ScaleDown actuator * handle directx nodes the same as gpu nodes * magnum: add an option to create insecure TLS connections We use self-signed certificates in the openstack for test purposes. It is not always easy to bring a CA certificate. And so we ran into the problem that there is no option to not check the validity of the certificate in the autoscaler. This patch adds a new option for the magnum plugin: tls-insecure Signed-off-by: Anton Kurbatov <[email protected]> * Drop unused maps * Extract criteria for removing unneded nodes to a separate package * skip instances on validation error if an instance is already being deleted/abandoned/not a member just continue * cleanup unused constants in clusterapi provider this change removes some unused values and adjusts the names in the unit tests to better reflect usage. * Update the example spec of civo cloudprovider Signed-off-by: Vishal Anarse <[email protected]> * Fix race condition in scale down test * Clean up stale OWNERS * add example for multiple recommenders * Balancer KEP * Add VPA E2E for recomemndation not exaclty matching pod Containers in recommendation can be different from recommendations in pod: - A new container can be added to a pod. At first there will be no recommendation for the container - A container can be removed from pod. For some time recommendation will contain recommendation for the old container - Container can be renamed. Then there will be recommendation for container under its old name. Add tests for what VPA does in those situations. * Add VPA E2E for recomemndation not exaclty matching pod with limit range Containers in recommendation can be different from recommendations in pod: - A new container can be added to a pod. At first there will be no recommendation for the container - A container can be removed from pod. For some time recommendation will contain recommendation for the old container - Container can be renamed. Then there will be recommendation for container under its old name. Add tests for what VPA does in those situations, when limit range exists. * Remove units for default boot disk size * Fix accessing index out of bonds The function should match containers to their recommendations directly instead of hoping thier order will match, See [this comment](kubernetes#3966 (comment)) * [vpa] introduce recommendation post processor * Fixed gofmt error. * Don't break scale up with priority expander config * added replicas count for daemonsets to prevent massive pod eviction Signed-off-by: Denis Romanenko <[email protected]> * code review, move flag to boolean for post processor * Add support for extended resource definition in GCE MIG template This commit adds the possibility to define extended resources for a node group on GCE, so that the cluster-autoscaler can account for them when taking scaling decisions. This is done through the `extended_resources` key inside the AUTOSCALER_ENV_VARS variable set on a MIG template. Signed-off-by: Mayeul Blanzat <[email protected]> * Make expander factory logic more pluggable * Add option to wait for a period of time after node tainting/cordoning Node state is refreshed and checked again before deleting the node It gives kube-scheduler time to acknowledge that nodes state has changed and to stop scheduling pods on them * remove the flag for Capping post-processor * remove unsupported functionality from cluster-api provider this change removes the code for the `Labels` and `Taints` interface functions of the clusterapi provider when scaling from zero. The body of these functions was added erronesouly and the Cluster API community is still deciding on how these values will be expose to the autoscaler. also updates the tests and readme to be more clear about the usage of labels and taints when scaling from zero. * Remove ScaleDown dependency on clusterStateRegistry * Adding support for identifying nodes that have been deleted from cloud provider that are still registered within Kubernetes. Avoids misidentifying not autoscaled nodes as deleted. Simplified implementation to use apiv1.Node instead of new struct. Expanded test cases to include not autoscaled nodes and tracking deleted nodes over multiple updates. Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion. * Implementing new cloud provider method for node deletion detection (kubernetes#1) * Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation. * Fixing go formatting issues with clusterstate_test * Fixing errors due to merge on branches. * Adjusting initial implementation of NodeExists to be consistent among cloud providers to return true and ErrNotImplemented. * Fix list scaling group instance pages bug Signed-off-by: jwcesign <[email protected]> * Format log output Signed-off-by: jwcesign <[email protected]> * Split out code from simulator package * Code Review: Do not return an error on malformed extended_resource + add more tests * Malformed extended resource definition should not fail the template building function. Instead, log the error and ignore extended resources * Remove useless existence check * Add tests around the extractExtendedResourcesFromKubeEnv function * Add a test case to verify that malformed extended resource definition does not fail the template build function Signed-off-by: Mayeul Blanzat <[email protected]> * huawei-cloudprovider:enable tags resolve for as Signed-off-by: jwcesign <[email protected]> * Magnum provider: switch UUID dependency from satori to gofrs Addresses issue kubernetes#5218, that the satori UUID package is unmaintained and has security vulnerabilities affecting generating random UUIDs. In the magnum cloud provider, this package was only used to check whether a string matches a UUIDv4 or not, so the vulnerability with generating UUIDs could not have been exploited. (Generating UUIDs is only done in the unit tests). The gofrs/uuid package is currenly at version 4.0.0 in go.mod, well past point at which it was forked and the vulnerability was fixed. It is a drop in replacement for verifying a UUID, and only a small change was needed in the testing code to handle a new returned error when generating a random UUID. * change uuid dependency in cluster autoscaler kamatera provider * Extract scheduling hints to a dedicated object This removes the need for passing maps back and forth when doing scheduling simulations. * Remove dead code for handling simulation errors * Fix typo, move service accounts to RBAC * VPA: Add missing --- to CRD manifests * Base parallel scale down implementation * Stop applying the beta.kubernetes.io/os and arch * [CA] Register recently evicted pods in NodeDeletionTracker. * Add KEP to introduce UpdateMode: UpscaleOnly * Clarify prometheus use-case * Adapt to review comments * Adapt KEP according to review * Add newline after header * Rename proposal directory to fit KEP title * Make KEP and implementation proposal consistent * remove post-processor factory * update test for MapToListOfRecommendedContainerResources * Update aws OWNERS Set all aws cloudprovider approvers as reviewers, so that aws-specific PRs can be handled without involving global CA reviewers. * Add ScaleDown.Actuator to AutoscalingContext * update the hyperlink of api-conventions.md file in comments * Support scaling up node groups to the configured min size if needed * Fix: add missing RBAC permissions to magnum examples Adding permissions to the ClusterRole in the example to avoid the error messages. * make spellchecker happy * Changing deletion logic to rely on a new helper method in ClusterStateRegistry, and remove old complicated logic. Adjust the naming of the method for cloud instance deletion from NodeExists to HasInstance. * Fix VPA deployment Use `kube-system` namespace for ServiceAccounts like it did before kubernetes#5268 * Don't say that `Recreate` and `Auto` VPA modes are experimental * Fixing go formatting issue in cloudstack cloud provider code. * Add missing cloud providers to readme and sort alphabetically Signed-off-by: Marcus Noble <[email protected]> * huawei-cloudprovider: enable taints resolve for as, modify the example yaml to accelerate node scale-down Signed-off-by: jwcesign <[email protected]> * Update cluster-autoscaler/README.md Co-authored-by: Guy Templeton <[email protected]> * cluster-autoscaler: refactor BalanceScaleUpBetweenGroups * Allow forking snapshot more than 1 time * Fork ClusterSnapshot in UpdateClusterState * add logging information to FAQ this change adds a section about how to increase the logging verbosity and why you might want to do that. * fix(cluster-autoscaler/hetzner): pre-existing volumes break scheduling The `hcloud-csi-driver` v1.x uses the label `csi.hetzner.cloud/location` for topology. This label was not added in the response to `n.TemplateNodeInfo()`, causing cluster-autoscaler to not consider any node group for scaling when a pre-existing volume was attached to the pending pod. This is fixed by adding the appropriatly named label to the `NodeInfo`. In practice this label is added by the `hcloud-csi-driver`. In the upcoming v2 of the driver we migrated to using `apiv1.LabelZoneRegionStable` for topology constraints, but this fix is still required so customers do not have to re-create all `PersistentVolumes`. Further details on the bug are available in the original issue: hetznercloud/csi-driver#302 * Added RBAC Permission to Azure. * Log node group min and current size when skipping scale down * Use scheduling package in filterOutSchedulable processor * Check owner reference in scale down planner to avoid double-counting already delete pods. * Add note regarding GPU label for the CAPI provider cluster-autoscaler takes into consideration the time that a node takes to initialise a GPU resource on a node, as long as a particular label is in place. This label differs from provider to provider, and is documented in some cases but not for CAPI. This commit adds a note with the specific label that should be applied when a node is instantiated. * chore(cluster-autoscaler/hetzner): add myself to OWNERS file * Use ScaleDownSetProcessor.GetNodesToRemove in scale down planner to filter NodesToDelete. * Handle pagination when looking through supported shapes. * Add OCI API files to handle OCI work-request operations. * Fail fast if OCI instance pool is out of capacity/quota. * update vendor to v1.26.0-rc.1 * fix issue 5332 * Deprecate v1beta1 API v1beta2 API was introduced in kubernetes#1668, it's present in VPA [0.4.0](https://github.com/kubernetes/autoscaler/tree/vertical-pod-autoscaler-0.4.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1beta2) but not in [0.3.1](https://github.com/kubernetes/autoscaler/tree/vertical-pod-autoscaler-0.3.1/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1beta2). I added comments to vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1beta2/types.go I generated changes to `vertical-pod-autoscaler/deploy/vpa-v1-crd-gen.yaml` with `vertical-pod-autoscaler/hack/generate-crd-yaml.sh` * Add note about `v1beta2` deprecation to README * fix issue 5332 - adding suggestied change * Break node categorization in scale down planner on timeout. * Automatically label cluster-autoscaler PRs * Add missing dot * fix generate ec2 instance types * Introduce a formal policy for maintaining cloudproviders The policy largely codifies what we've already been doing for years (including the requirements we've already imposed on new providers). * Introduce Cloudprovider Maintenance Request to policy * feat(helm): add rancher cloud config support Autoscaler 1.25.0 adds "rancher" cloud provider support, it requires setting cloudConfigPath. If the user mounts this as a secret and sets this value appropriately, this change sets the argument required to point to the mounted secret. Previously, this was only set if cloud provider was magnum or aws. * Updating error messaging and fallback behavior of hasCloudProviderInstance. Changing deletedNodes to store empty struct instead of node values, and modifying the helper function to utilize that information for tests. * Fixing helper function to simplify for loop to retrieve deleted node names. * Use PdbRemainingDisruptions in Planner * Put risky NodeToRemove in the end of needDrain list * Auto Label Helm Chart PRs * psp_api * Create a Planner object if --parallelDrain=true * Export execution_latency_seconds metric from VPA admission controller Sometimes I see admissions that are slower than the rest. Logs indicate that `AdmissionServer.admit` doesn't get slow (it's only part with logging). I'd like to have a metric which will tell us what's slow so that we can maybe improve that. * aws: add nodegroup name to default labels * Fix int formatting in threshold_based_limiter logs * rancher-cloudprovider: Improve node group discovery Previsouly the rancher provider tried to parse the node `spec.providerID` to extract the node group name. Instead, we now get the machines by the node name and then use a rancher specific label that should always be on the machine. This should work more reliably for all the different node drivers that rancher supports. Signed-off-by: Cyrill Troxler <[email protected]> * Don't add pods from drained nodes in scale-down * Add default PodListProcessor wrapper * Add currently drained pods before scale-up * set cluster_autoscaler_max_nodes_count dynamically Signed-off-by: yasin.lachiny <[email protected]> * fix(helm): bump chart ver -> 9.21.1 * CA - AWS - Update Hardcoded Instance Details List to 11-12-2022 * Add x13n to cluster autoscaler approvers * update prometheus metric min maxNodesCount and a.MaxNodesTotal Signed-off-by: yasin.lachiny <[email protected]> * CA - AWS - Update Docs all actions IAM policy * Cluster Autoscaler: update vendor to k8s v1.26.0 * removed dotimports from framework.go * fixed another dotimport * add missing vpa vendor,e2e/vendor to sync branch * removed old files from vpa vendor to fix test --------- Signed-off-by: Anton Kurbatov <[email protected]> Signed-off-by: Vishal Anarse <[email protected]> Signed-off-by: Denis Romanenko <[email protected]> Signed-off-by: Mayeul Blanzat <[email protected]> Signed-off-by: jwcesign <[email protected]> Signed-off-by: Marcus Noble <[email protected]> Signed-off-by: Cyrill Troxler <[email protected]> Signed-off-by: yasin.lachiny <[email protected]> Co-authored-by: Daniel Kłobuszewski <[email protected]> Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: Joachim Bartosik <[email protected]> Co-authored-by: Damir Markovic <[email protected]> Co-authored-by: Shubham Kuchhal <[email protected]> Co-authored-by: Marco Voelz <[email protected]> Co-authored-by: Prachi Gandhi <[email protected]> Co-authored-by: bdobay <[email protected]> Co-authored-by: Juan Borda <[email protected]> Co-authored-by: Fabio Berchtold <[email protected]> Co-authored-by: Clint Fooken <[email protected]> Co-authored-by: Jayant Jain <[email protected]> Co-authored-by: Yaroslava Serdiuk <[email protected]> Co-authored-by: Flavian <[email protected]> Co-authored-by: Anton Kurbatov <[email protected]> Co-authored-by: Fulton Byrne <[email protected]> Co-authored-by: Michael McCune <[email protected]> Co-authored-by: Vishal Anarse <[email protected]> Co-authored-by: Matthias Bertschy <[email protected]> Co-authored-by: Marcin Wielgus <[email protected]> Co-authored-by: David Benque <[email protected]> Co-authored-by: Denis Romanenko <[email protected]> Co-authored-by: Mayeul Blanzat <[email protected]> Co-authored-by: Alexandru Matei <[email protected]> Co-authored-by: Clint <[email protected]> Co-authored-by: jwcesign <[email protected]> Co-authored-by: Thomas Hartland <[email protected]> Co-authored-by: Ori Hoch <[email protected]> Co-authored-by: Joel Smith <[email protected]> Co-authored-by: Paco Xu <[email protected]> Co-authored-by: Aleksandra Gacek <[email protected]> Co-authored-by: Marco Voelz <[email protected]> Co-authored-by: Bartłomiej Wróblewski <[email protected]> Co-authored-by: hangcui <[email protected]> Co-authored-by: Xintong Liu <[email protected]> Co-authored-by: GanjMonk <[email protected]> Co-authored-by: Marcus Noble <[email protected]> Co-authored-by: Marcus Noble <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: Michael Grosser <[email protected]> Co-authored-by: Julian Tölle <[email protected]> Co-authored-by: Nick Jones <[email protected]> Co-authored-by: jesse.millan <[email protected]> Co-authored-by: Jordan Liggitt <[email protected]> Co-authored-by: McGonigle, Neil <[email protected]> Co-authored-by: Anton Khizunov <[email protected]> Co-authored-by: Maciek Pytel <[email protected]> Co-authored-by: Basit Mustafa <[email protected]> Co-authored-by: xval2307 <[email protected]> Co-authored-by: yznima <[email protected]> Co-authored-by: Cyrill Troxler <[email protected]> Co-authored-by: yasin.lachiny <[email protected]> Co-authored-by: Kuba Tużnik <[email protected]>
- Loading branch information