Sync with upstream v1.21.3 (#129)

* Set maxAsgNamesPerDescribe to the new maximum value While this was previously effectively limited to 50, `DescribeAutoScalingGroups` now supports fetching 100 ASG per calls on all regions, matching what's documented: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_DescribeAutoScalingGroups.html ``` AutoScalingGroupNames.member.N The names of the Auto Scaling groups. By default, you can only specify up to 50 names. You can optionally increase this limit using the MaxRecords parameter. MaxRecords The maximum number of items to return with this call. The default value is 50 and the maximum value is 100. ``` Doubling this halves API calls on large clusters, which should help to prevent throttling. * Break out unmarshal from GenerateEC2InstanceTypes Refactor to allow for optimisation * Optimise GenerateEC2InstanceTypes unmarshal memory usage The pricing json for us-east-1 is currently 129MB. Currently fetching this into memory and parsing results in a large memory footprint on startup, and can lead to the autoscaler being OOMKilled. Change the ReadAll/Unmarshal logic to a stream decoder to significantly reduce the memory use. * use aws sdk to find region * Merge pull request kubernetes#4274 from kinvolk/imran/cloud-provider-packet-fix Cloud provider[Packet] fixes * Fix templated nodeinfo names collisions in BinpackingNodeEstimator Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses the same shared DeepCopyTemplateNode function and inherits its naming pattern, which is great as that fixes a long standing bug. Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with generated nodeinfos and nodes having predictable names (using template name + an incremental ordinal starting at 0) for upcoming nodes. Later, when it looks for fitting nodes for unschedulable pods (when upcoming nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity, or pods antiaffinity, ...), the binpacking estimator will also build virtual nodes and place them in a snapshot fork to evaluate scheduler predicates. Those temporary virtual nodes are built using the same pattern (template name and an index ordinal also starting at 0) as the one previously used by `getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes names for nodegroups having upcoming nodes. But adding nodes by the same name in an existing cluster snapshot isn't allowed, and the evaluation attempt will fail. Practically this blocks re-upscales for nodegroups having upcoming nodes, which can cause a significant delay. * Improve misleading log Signed-off-by: Sylvain Rabot <[email protected]> * dont proactively decrement azure cache for unregistered nodes * annotate fakeNodes so that cloudprovider implementations can identify them if needed * move annotations to cloudprovider package * Cluster Autoscaler 1.21.1 * CA - AWS - Instance List Update 03-10-21 - 1.21 release branch * CA - AWS - Instance List Update 29-10-21 - 1.21 release branch * Cluster-Autoscaler update AWS EC2 instance types with g5, m6 and r6 * CA - AWS Instance List Update - 13/12/21 - 1.21 * Merge pull request kubernetes#4497 from marwanad/add-more-azure-instance-types add more azure instance types * Cluster Autoscaler 1.21.2 * Add `--feature-gates` flag to support scale up on volume limits (CSI migration enabled) Signed-off-by: ialidzhikov <[email protected]> * [Cherry pick 1.21] Remove TestDeleteBlob UT Signed-off-by: Zhecheng Li <[email protected]> * cherry-pick kubernetes#4022 [cluster-autoscaler] Publish node group min/max metrics * Skipping metrics tests added in kubernetes#4022 Each test works in isolation, but they cause panic when the entire suite is run (ex. make test-in-docker), because the underlying metrics library panics when the same metric is registered twice. (cherry picked from commit 52392b3) * cherry-pick kubernetes#4162 and kubernetes#4172 [cluster-autoscaler]Add flag to control DaemonSet eviction on non-empty nodes & Allow DaemonSet pods to opt in/out from eviction. * CA - AWS Cloud Provider - 1.21 Static Instance List Update 02-06-2022 * fix instance type fallback Instead of logging a fatal error, log a standard error and fall back to loading instance types from the static list. * Cluster Autoscaler - 1.21.3 release * FAQ updated * Sync_changes file updated Co-authored-by: Benjamin Pineau <[email protected]> Co-authored-by: Adrian Lai <[email protected]> Co-authored-by: darkpssngr <[email protected]> Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: Sylvain Rabot <[email protected]> Co-authored-by: Marwan Ahmed <[email protected]> Co-authored-by: Jakub Tużnik <[email protected]> Co-authored-by: GuyTempleton <[email protected]> Co-authored-by: sturman <[email protected]> Co-authored-by: Maciek Pytel <[email protected]> Co-authored-by: ialidzhikov <[email protected]> Co-authored-by: Zhecheng Li <[email protected]> Co-authored-by: Shubham Kuchhal <[email protected]> Co-authored-by: Todd Neal <[email protected]>
gardener · Jun 25, 2022 · 1b0b9f5 · 1b0b9f5
1 parent f74568e
commit 1b0b9f5
Show file tree

Hide file tree

Showing 32 changed files with 3,545 additions and 472 deletions.
diff --git a/cluster-autoscaler/FAQ.md b/cluster-autoscaler/FAQ.md
diff --git a/cluster-autoscaler/SYNC-CHANGES/SYNC-CHANGES-1.21.md b/cluster-autoscaler/SYNC-CHANGES/SYNC-CHANGES-1.21.md
@@ -7,6 +7,13 @@
         - [During merging](#during-merging)
         - [During vendoring k8s](#during-vendoring-k8s)
         - [Others](#others)
+- [v1.21.1](#v1211)
+    - [Synced with which upstream CA](#synced-with-which-upstream-ca-1)
+    - [Changes made](#changes-made-1)
+        - [To FAQ](#to-faq-1)
+        - [During merging](#during-merging-1)
+        - [During vendoring k8s](#during-vendoring-k8s-1)
+        - [Others](#others-1)
 
 
 # v1.21.0
@@ -38,3 +45,28 @@
     - cluster-autoscaler/cloudprovider/builder/builder_all.go
     - cluster-autoscaler/cloudprovider/mcm
     - cluster-autoscaler/integration
+
+
+# v1.21.1
+
+
+## Synced with which upstream CA
+
+[v1.21.3](https://github.com/kubernetes/autoscaler/tree/cluster-autoscaler-1.21.3/cluster-autoscaler)
+
+## Changes made
+
+### To FAQ
+
+- included new questions and answers
+- included new steps to vendor new MCM version
+
+### During merging
+
+- included new ec2 instances in `cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go`
+
+### During vendoring k8s
+Still vendoring k8s 1.21.0 in this fork but the upstream 1.21.3 is vendoring k8s 1.25.0
+
+### Others
+_None_
diff --git a/cluster-autoscaler/cloudprovider/aws/auto_scaling_test.go b/cluster-autoscaler/cloudprovider/aws/auto_scaling_test.go
@@ -27,22 +27,22 @@ import (
 	"github.com/stretchr/testify/require"
 )
 
-func TestMoreThen50Groups(t *testing.T) {
+func TestMoreThen100Groups(t *testing.T) {
 	service := &AutoScalingMock{}
 	autoScalingWrapper := &autoScalingWrapper{
 		autoScaling: service,
 	}
 
-	// Generate 51 ASG names
-	names := make([]string, 51)
+	// Generate 101 ASG names
+	names := make([]string, 101)
 	for i := 0; i < len(names); i++ {
 		names[i] = fmt.Sprintf("asg-%d", i)
 	}
 
-	// First batch, first 50 elements
+	// First batch, first 100 elements
 	service.On("DescribeAutoScalingGroupsPages",
 		&autoscaling.DescribeAutoScalingGroupsInput{
-			AutoScalingGroupNames: aws.StringSlice(names[:50]),
+			AutoScalingGroupNames: aws.StringSlice(names[:100]),
 			MaxRecords:            aws.Int64(maxRecordsReturnedByAPI),
 		},
 		mock.AnythingOfType("func(*autoscaling.DescribeAutoScalingGroupsOutput, bool) bool"),
@@ -51,10 +51,10 @@ func TestMoreThen50Groups(t *testing.T) {
 		fn(testNamedDescribeAutoScalingGroupsOutput("asg-1", 1, "test-instance-id"), false)
 	}).Return(nil)
 
-	// Second batch, element 51
+	// Second batch, element 101
 	service.On("DescribeAutoScalingGroupsPages",
 		&autoscaling.DescribeAutoScalingGroupsInput{
-			AutoScalingGroupNames: aws.StringSlice([]string{"asg-50"}),
+			AutoScalingGroupNames: aws.StringSlice([]string{"asg-100"}),
 			MaxRecords:            aws.Int64(maxRecordsReturnedByAPI),
 		},
 		mock.AnythingOfType("func(*autoscaling.DescribeAutoScalingGroupsOutput, bool) bool"),

diff --git a/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go b/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go
@@ -362,7 +362,10 @@ func BuildAWS(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover
 
 		generatedInstanceTypes, err := GenerateEC2InstanceTypes(region)
 		if err != nil {
-			klog.Fatalf("Failed to generate AWS EC2 Instance Types: %v", err)
+			klog.Errorf("Failed to generate AWS EC2 Instance Types: %v, falling back to static list with last update time: %s", err, lastUpdateTime)
+		}
+		if generatedInstanceTypes == nil {
+			generatedInstanceTypes = map[string]*InstanceType{}
 		}
 		// fallback on the static list if we miss any instance types in the generated output
 		// credits to: https://github.com/lyft/cni-ipvlan-vpc-k8s/pull/80

diff --git a/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider_test.go b/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider_test.go
@@ -17,6 +17,7 @@ limitations under the License.
 package aws
 
 import (
+	"os"
 	"testing"
 
 	"github.com/aws/aws-sdk-go/aws"
@@ -26,6 +27,7 @@ import (
 	"github.com/stretchr/testify/mock"
 	apiv1 "k8s.io/api/core/v1"
 	"k8s.io/autoscaler/cluster-autoscaler/cloudprovider"
+	"k8s.io/autoscaler/cluster-autoscaler/config"
 )
 
 type AutoScalingMock struct {
@@ -148,6 +150,23 @@ func TestBuildAwsCloudProvider(t *testing.T) {
 	assert.NoError(t, err)
 }
 
+func TestInstanceTypeFallback(t *testing.T) {
+	resourceLimiter := cloudprovider.NewResourceLimiter(
+		map[string]int64{cloudprovider.ResourceNameCores: 1, cloudprovider.ResourceNameMemory: 10000000},
+		map[string]int64{cloudprovider.ResourceNameCores: 10, cloudprovider.ResourceNameMemory: 100000000})
+
+	do := cloudprovider.NodeGroupDiscoveryOptions{}
+	opts := config.AutoscalingOptions{}
+
+	os.Setenv("AWS_REGION", "non-existent-region")
+	defer os.Unsetenv("AWS_REGION")
+
+	// This test ensures that no klog.Fatalf calls occur when constructing the AWS cloud provider.  Specifically it is
+	// intended to ensure that instance type fallback works correctly in the event of an error enumerating instance
+	// types.
+	_ = BuildAWS(opts, do, resourceLimiter)
+}
+
 func TestName(t *testing.T) {
 	provider := testProvider(t, testAwsManager)
 	assert.Equal(t, provider.Name(), cloudprovider.AwsProviderName)

diff --git a/cluster-autoscaler/cloudprovider/aws/aws_manager.go b/cluster-autoscaler/cloudprovider/aws/aws_manager.go
@@ -49,7 +49,7 @@ const (
 	operationWaitTimeout    = 5 * time.Second
 	operationPollInterval   = 100 * time.Millisecond
 	maxRecordsReturnedByAPI = 100
-	maxAsgNamesPerDescribe  = 50
+	maxAsgNamesPerDescribe  = 100
 	refreshInterval         = 1 * time.Minute
 	autoDiscovererTypeASG   = "asg"
 	asgAutoDiscovererKeyTag = "tag"
@@ -312,7 +312,7 @@ func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) {
 	region := az[0 : len(az)-1]
 
 	if len(asg.AvailabilityZones) > 1 {
-		klog.Warningf("Found multiple availability zones for ASG %q; using %s\n", asg.Name, az)
+		klog.V(4).Infof("Found multiple availability zones for ASG %q; using %s for %s label\n", asg.Name, az, apiv1.LabelFailureDomainBetaZone)
 	}
 
 	instanceTypeName, err := m.buildInstanceType(asg)

diff --git a/cluster-autoscaler/cloudprovider/aws/aws_util.go b/cluster-autoscaler/cloudprovider/aws/aws_util.go
@@ -20,21 +20,26 @@ import (
 	"encoding/json"
 	"errors"
 	"fmt"
-	"github.com/aws/aws-sdk-go/aws/endpoints"
-	"io/ioutil"
-	klog "k8s.io/klog/v2"
+	"io"
 	"net/http"
 	"os"
 	"regexp"
 	"strconv"
 	"strings"
+
+	"github.com/aws/aws-sdk-go/aws"
+	"github.com/aws/aws-sdk-go/aws/ec2metadata"
+	"github.com/aws/aws-sdk-go/aws/endpoints"
+	"github.com/aws/aws-sdk-go/aws/session"
+
+	klog "k8s.io/klog/v2"
 )
 
 var (
-	ec2MetaDataServiceUrl          = "http://169.254.169.254/latest/dynamic/instance-identity/document"
+	ec2MetaDataServiceUrl          = "http://169.254.169.254"
 	ec2PricingServiceUrlTemplate   = "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/%s/index.json"
 	ec2PricingServiceUrlTemplateCN = "https://pricing.cn-north-1.amazonaws.com.cn/offers/v1.0/cn/AmazonEC2/current/%s/index.json"
-	staticListLastUpdateTime       = "2020-12-07"
+	staticListLastUpdateTime       = "2022-06-02"
 )
 
 type response struct {
@@ -82,16 +87,9 @@ func GenerateEC2InstanceTypes(region string) (map[string]*InstanceType, error) {
 
 			defer res.Body.Close()
 
-			body, err := ioutil.ReadAll(res.Body)
+			unmarshalled, err := unmarshalProductsResponse(res.Body)
 			if err != nil {
-				klog.Warningf("Error parsing %s skipping...\n", url)
-				continue
-			}
-
-			var unmarshalled = response{}
-			err = json.Unmarshal(body, &unmarshalled)
-			if err != nil {
-				klog.Warningf("Error unmarshalling %s, skip...\n", url)
+				klog.Warningf("Error parsing %s skipping...\n%s\n", url, err)
 				continue
 			}
 
@@ -127,6 +125,58 @@ func GetStaticEC2InstanceTypes() (map[string]*InstanceType, string) {
 	return InstanceTypes, staticListLastUpdateTime
 }
 
+func unmarshalProductsResponse(r io.Reader) (*response, error) {
+	dec := json.NewDecoder(r)
+	t, err := dec.Token()
+	if err != nil {
+		return nil, err
+	}
+	if delim, ok := t.(json.Delim); !ok || delim.String() != "{" {
+		return nil, errors.New("Invalid products json")
+	}
+
+	unmarshalled := response{map[string]product{}}
+
+	for dec.More() {
+		t, err = dec.Token()
+		if err != nil {
+			return nil, err
+		}
+
+		if t == "products" {
+			tt, err := dec.Token()
+			if err != nil {
+				return nil, err
+			}
+			if delim, ok := tt.(json.Delim); !ok || delim.String() != "{" {
+				return nil, errors.New("Invalid products json")
+			}
+			for dec.More() {
+				productCode, err := dec.Token()
+				if err != nil {
+					return nil, err
+				}
+
+				prod := product{}
+				if err = dec.Decode(&prod); err != nil {
+					return nil, err
+				}
+				unmarshalled.Products[productCode.(string)] = prod
+			}
+		}
+	}
+
+	t, err = dec.Token()
+	if err != nil {
+		return nil, err
+	}
+	if delim, ok := t.(json.Delim); !ok || delim.String() != "}" {
+		return nil, errors.New("Invalid products json")
+	}
+
+	return &unmarshalled, nil
+}
+
 func parseMemory(memory string) int64 {
 	reg, err := regexp.Compile("[^0-9\\.]+")
 	if err != nil {
@@ -155,26 +205,13 @@ func GetCurrentAwsRegion() (string, error) {
 	region, present := os.LookupEnv("AWS_REGION")
 
 	if !present {
-		klog.V(1).Infof("fetching %s\n", ec2MetaDataServiceUrl)
-		res, err := http.Get(ec2MetaDataServiceUrl)
-		if err != nil {
-			return "", fmt.Errorf("Error fetching %s", ec2MetaDataServiceUrl)
-		}
-
-		defer res.Body.Close()
-
-		body, err := ioutil.ReadAll(res.Body)
+		c := aws.NewConfig().
+			WithEndpoint(ec2MetaDataServiceUrl)
+		sess, err := session.NewSession()
 		if err != nil {
-			return "", fmt.Errorf("Error parsing %s", ec2MetaDataServiceUrl)
+			return "", fmt.Errorf("failed to create session")
 		}
-
-		var unmarshalled = map[string]string{}
-		err = json.Unmarshal(body, &unmarshalled)
-		if err != nil {
-			klog.Warningf("Error unmarshalling %s, skip...\n", ec2MetaDataServiceUrl)
-		}
-
-		region = unmarshalled["region"]
+		return ec2metadata.New(sess, c).Region()
 	}
 
 	return region, nil