Skip to content

Commit

Permalink
Sync with upstream v1.21.3 (#129)
Browse files Browse the repository at this point in the history
* Set maxAsgNamesPerDescribe to the new maximum value

While this was previously effectively limited to 50, `DescribeAutoScalingGroups` now supports
fetching 100 ASG per calls on all regions, matching what's documented:
https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_DescribeAutoScalingGroups.html
```
     AutoScalingGroupNames.member.N
       The names of the Auto Scaling groups.
       By default, you can only specify up to 50 names.
       You can optionally increase this limit using the MaxRecords parameter.
     MaxRecords
       The maximum number of items to return with this call.
       The default value is 50 and the maximum value is 100.
```

Doubling this halves API calls on large clusters, which should help to prevent throttling.

* Break out unmarshal from GenerateEC2InstanceTypes

Refactor to allow for optimisation

* Optimise GenerateEC2InstanceTypes unmarshal memory usage

The pricing json for us-east-1 is currently 129MB. Currently fetching
this into memory and parsing results in a large memory footprint on
startup, and can lead to the autoscaler being OOMKilled.

Change the ReadAll/Unmarshal logic to a stream decoder to significantly
reduce the memory use.

* use aws sdk to find region

* Merge pull request kubernetes#4274 from kinvolk/imran/cloud-provider-packet-fix

Cloud provider[Packet] fixes

* Fix templated nodeinfo names collisions in BinpackingNodeEstimator

Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses
the same shared DeepCopyTemplateNode function and inherits its naming
pattern, which is great as that fixes a long standing bug.

Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with
generated nodeinfos and nodes having predictable names (using template name
+ an incremental ordinal starting at 0) for upcoming nodes.

Later, when it looks for fitting nodes for unschedulable pods (when upcoming
nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity,
or pods antiaffinity, ...), the binpacking estimator will also build virtual
nodes and place them in a snapshot fork to evaluate scheduler predicates.

Those temporary virtual nodes are built using the same pattern (template name
and an index ordinal also starting at 0) as the one previously used by
`getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes
names for nodegroups having upcoming nodes.

But adding nodes by the same name in an existing cluster snapshot isn't
allowed, and the evaluation attempt will fail.

Practically this blocks re-upscales for nodegroups having upcoming nodes,
which can cause a significant delay.

* Improve misleading log

Signed-off-by: Sylvain Rabot <[email protected]>

* dont proactively decrement azure cache for unregistered nodes

* annotate fakeNodes so that cloudprovider implementations can identify them if needed

* move annotations to cloudprovider package

* Cluster Autoscaler 1.21.1

* CA - AWS - Instance List Update 03-10-21 - 1.21 release branch

* CA - AWS - Instance List Update 29-10-21 - 1.21 release branch

* Cluster-Autoscaler update AWS EC2 instance types with g5, m6 and r6

* CA - AWS Instance List Update - 13/12/21 - 1.21

* Merge pull request kubernetes#4497 from marwanad/add-more-azure-instance-types

add more azure instance types

* Cluster Autoscaler 1.21.2

* Add `--feature-gates` flag to support scale up on volume limits (CSI migration enabled)

Signed-off-by: ialidzhikov <[email protected]>

* [Cherry pick 1.21] Remove TestDeleteBlob UT

Signed-off-by: Zhecheng Li <[email protected]>

* cherry-pick kubernetes#4022 [cluster-autoscaler] Publish node group min/max metrics

* Skipping metrics tests added in kubernetes#4022

Each test works in isolation, but they cause panic when the entire
suite is run (ex. make test-in-docker), because the underlying
metrics library panics when the same metric is registered twice.

(cherry picked from commit 52392b3)

* cherry-pick kubernetes#4162 and kubernetes#4172 [cluster-autoscaler]Add flag to control DaemonSet eviction on non-empty nodes & Allow DaemonSet pods to opt in/out
from eviction.

* CA - AWS Cloud Provider - 1.21 Static Instance List Update 02-06-2022

* fix instance type fallback

Instead of logging a fatal error, log a standard error and fall back to
loading instance types from the static list.

* Cluster Autoscaler - 1.21.3 release

* FAQ updated

* Sync_changes file updated

Co-authored-by: Benjamin Pineau <[email protected]>
Co-authored-by: Adrian Lai <[email protected]>
Co-authored-by: darkpssngr <[email protected]>
Co-authored-by: Kubernetes Prow Robot <[email protected]>
Co-authored-by: Sylvain Rabot <[email protected]>
Co-authored-by: Marwan Ahmed <[email protected]>
Co-authored-by: Jakub Tużnik <[email protected]>
Co-authored-by: GuyTempleton <[email protected]>
Co-authored-by: sturman <[email protected]>
Co-authored-by: Maciek Pytel <[email protected]>
Co-authored-by: ialidzhikov <[email protected]>
Co-authored-by: Zhecheng Li <[email protected]>
Co-authored-by: Shubham Kuchhal <[email protected]>
Co-authored-by: Todd Neal <[email protected]>
  • Loading branch information
15 people authored Jun 25, 2022
1 parent f74568e commit 1b0b9f5
Show file tree
Hide file tree
Showing 32 changed files with 3,545 additions and 472 deletions.
486 changes: 283 additions & 203 deletions cluster-autoscaler/FAQ.md

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions cluster-autoscaler/SYNC-CHANGES/SYNC-CHANGES-1.21.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,13 @@
- [During merging](#during-merging)
- [During vendoring k8s](#during-vendoring-k8s)
- [Others](#others)
- [v1.21.1](#v1211)
- [Synced with which upstream CA](#synced-with-which-upstream-ca-1)
- [Changes made](#changes-made-1)
- [To FAQ](#to-faq-1)
- [During merging](#during-merging-1)
- [During vendoring k8s](#during-vendoring-k8s-1)
- [Others](#others-1)


# v1.21.0
Expand Down Expand Up @@ -38,3 +45,28 @@
- cluster-autoscaler/cloudprovider/builder/builder_all.go
- cluster-autoscaler/cloudprovider/mcm
- cluster-autoscaler/integration


# v1.21.1


## Synced with which upstream CA

[v1.21.3](https://github.com/kubernetes/autoscaler/tree/cluster-autoscaler-1.21.3/cluster-autoscaler)

## Changes made

### To FAQ

- included new questions and answers
- included new steps to vendor new MCM version

### During merging

- included new ec2 instances in `cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go`

### During vendoring k8s
Still vendoring k8s 1.21.0 in this fork but the upstream 1.21.3 is vendoring k8s 1.25.0

### Others
_None_
14 changes: 7 additions & 7 deletions cluster-autoscaler/cloudprovider/aws/auto_scaling_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,22 @@ import (
"github.com/stretchr/testify/require"
)

func TestMoreThen50Groups(t *testing.T) {
func TestMoreThen100Groups(t *testing.T) {
service := &AutoScalingMock{}
autoScalingWrapper := &autoScalingWrapper{
autoScaling: service,
}

// Generate 51 ASG names
names := make([]string, 51)
// Generate 101 ASG names
names := make([]string, 101)
for i := 0; i < len(names); i++ {
names[i] = fmt.Sprintf("asg-%d", i)
}

// First batch, first 50 elements
// First batch, first 100 elements
service.On("DescribeAutoScalingGroupsPages",
&autoscaling.DescribeAutoScalingGroupsInput{
AutoScalingGroupNames: aws.StringSlice(names[:50]),
AutoScalingGroupNames: aws.StringSlice(names[:100]),
MaxRecords: aws.Int64(maxRecordsReturnedByAPI),
},
mock.AnythingOfType("func(*autoscaling.DescribeAutoScalingGroupsOutput, bool) bool"),
Expand All @@ -51,10 +51,10 @@ func TestMoreThen50Groups(t *testing.T) {
fn(testNamedDescribeAutoScalingGroupsOutput("asg-1", 1, "test-instance-id"), false)
}).Return(nil)

// Second batch, element 51
// Second batch, element 101
service.On("DescribeAutoScalingGroupsPages",
&autoscaling.DescribeAutoScalingGroupsInput{
AutoScalingGroupNames: aws.StringSlice([]string{"asg-50"}),
AutoScalingGroupNames: aws.StringSlice([]string{"asg-100"}),
MaxRecords: aws.Int64(maxRecordsReturnedByAPI),
},
mock.AnythingOfType("func(*autoscaling.DescribeAutoScalingGroupsOutput, bool) bool"),
Expand Down
5 changes: 4 additions & 1 deletion cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,10 @@ func BuildAWS(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover

generatedInstanceTypes, err := GenerateEC2InstanceTypes(region)
if err != nil {
klog.Fatalf("Failed to generate AWS EC2 Instance Types: %v", err)
klog.Errorf("Failed to generate AWS EC2 Instance Types: %v, falling back to static list with last update time: %s", err, lastUpdateTime)
}
if generatedInstanceTypes == nil {
generatedInstanceTypes = map[string]*InstanceType{}
}
// fallback on the static list if we miss any instance types in the generated output
// credits to: https://github.com/lyft/cni-ipvlan-vpc-k8s/pull/80
Expand Down
19 changes: 19 additions & 0 deletions cluster-autoscaler/cloudprovider/aws/aws_cloud_provider_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ limitations under the License.
package aws

import (
"os"
"testing"

"github.com/aws/aws-sdk-go/aws"
Expand All @@ -26,6 +27,7 @@ import (
"github.com/stretchr/testify/mock"
apiv1 "k8s.io/api/core/v1"
"k8s.io/autoscaler/cluster-autoscaler/cloudprovider"
"k8s.io/autoscaler/cluster-autoscaler/config"
)

type AutoScalingMock struct {
Expand Down Expand Up @@ -148,6 +150,23 @@ func TestBuildAwsCloudProvider(t *testing.T) {
assert.NoError(t, err)
}

func TestInstanceTypeFallback(t *testing.T) {
resourceLimiter := cloudprovider.NewResourceLimiter(
map[string]int64{cloudprovider.ResourceNameCores: 1, cloudprovider.ResourceNameMemory: 10000000},
map[string]int64{cloudprovider.ResourceNameCores: 10, cloudprovider.ResourceNameMemory: 100000000})

do := cloudprovider.NodeGroupDiscoveryOptions{}
opts := config.AutoscalingOptions{}

os.Setenv("AWS_REGION", "non-existent-region")
defer os.Unsetenv("AWS_REGION")

// This test ensures that no klog.Fatalf calls occur when constructing the AWS cloud provider. Specifically it is
// intended to ensure that instance type fallback works correctly in the event of an error enumerating instance
// types.
_ = BuildAWS(opts, do, resourceLimiter)
}

func TestName(t *testing.T) {
provider := testProvider(t, testAwsManager)
assert.Equal(t, provider.Name(), cloudprovider.AwsProviderName)
Expand Down
4 changes: 2 additions & 2 deletions cluster-autoscaler/cloudprovider/aws/aws_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ const (
operationWaitTimeout = 5 * time.Second
operationPollInterval = 100 * time.Millisecond
maxRecordsReturnedByAPI = 100
maxAsgNamesPerDescribe = 50
maxAsgNamesPerDescribe = 100
refreshInterval = 1 * time.Minute
autoDiscovererTypeASG = "asg"
asgAutoDiscovererKeyTag = "tag"
Expand Down Expand Up @@ -312,7 +312,7 @@ func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) {
region := az[0 : len(az)-1]

if len(asg.AvailabilityZones) > 1 {
klog.Warningf("Found multiple availability zones for ASG %q; using %s\n", asg.Name, az)
klog.V(4).Infof("Found multiple availability zones for ASG %q; using %s for %s label\n", asg.Name, az, apiv1.LabelFailureDomainBetaZone)
}

instanceTypeName, err := m.buildInstanceType(asg)
Expand Down
101 changes: 69 additions & 32 deletions cluster-autoscaler/cloudprovider/aws/aws_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,26 @@ import (
"encoding/json"
"errors"
"fmt"
"github.com/aws/aws-sdk-go/aws/endpoints"
"io/ioutil"
klog "k8s.io/klog/v2"
"io"
"net/http"
"os"
"regexp"
"strconv"
"strings"

"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/ec2metadata"
"github.com/aws/aws-sdk-go/aws/endpoints"
"github.com/aws/aws-sdk-go/aws/session"

klog "k8s.io/klog/v2"
)

var (
ec2MetaDataServiceUrl = "http://169.254.169.254/latest/dynamic/instance-identity/document"
ec2MetaDataServiceUrl = "http://169.254.169.254"
ec2PricingServiceUrlTemplate = "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/%s/index.json"
ec2PricingServiceUrlTemplateCN = "https://pricing.cn-north-1.amazonaws.com.cn/offers/v1.0/cn/AmazonEC2/current/%s/index.json"
staticListLastUpdateTime = "2020-12-07"
staticListLastUpdateTime = "2022-06-02"
)

type response struct {
Expand Down Expand Up @@ -82,16 +87,9 @@ func GenerateEC2InstanceTypes(region string) (map[string]*InstanceType, error) {

defer res.Body.Close()

body, err := ioutil.ReadAll(res.Body)
unmarshalled, err := unmarshalProductsResponse(res.Body)
if err != nil {
klog.Warningf("Error parsing %s skipping...\n", url)
continue
}

var unmarshalled = response{}
err = json.Unmarshal(body, &unmarshalled)
if err != nil {
klog.Warningf("Error unmarshalling %s, skip...\n", url)
klog.Warningf("Error parsing %s skipping...\n%s\n", url, err)
continue
}

Expand Down Expand Up @@ -127,6 +125,58 @@ func GetStaticEC2InstanceTypes() (map[string]*InstanceType, string) {
return InstanceTypes, staticListLastUpdateTime
}

func unmarshalProductsResponse(r io.Reader) (*response, error) {
dec := json.NewDecoder(r)
t, err := dec.Token()
if err != nil {
return nil, err
}
if delim, ok := t.(json.Delim); !ok || delim.String() != "{" {
return nil, errors.New("Invalid products json")
}

unmarshalled := response{map[string]product{}}

for dec.More() {
t, err = dec.Token()
if err != nil {
return nil, err
}

if t == "products" {
tt, err := dec.Token()
if err != nil {
return nil, err
}
if delim, ok := tt.(json.Delim); !ok || delim.String() != "{" {
return nil, errors.New("Invalid products json")
}
for dec.More() {
productCode, err := dec.Token()
if err != nil {
return nil, err
}

prod := product{}
if err = dec.Decode(&prod); err != nil {
return nil, err
}
unmarshalled.Products[productCode.(string)] = prod
}
}
}

t, err = dec.Token()
if err != nil {
return nil, err
}
if delim, ok := t.(json.Delim); !ok || delim.String() != "}" {
return nil, errors.New("Invalid products json")
}

return &unmarshalled, nil
}

func parseMemory(memory string) int64 {
reg, err := regexp.Compile("[^0-9\\.]+")
if err != nil {
Expand Down Expand Up @@ -155,26 +205,13 @@ func GetCurrentAwsRegion() (string, error) {
region, present := os.LookupEnv("AWS_REGION")

if !present {
klog.V(1).Infof("fetching %s\n", ec2MetaDataServiceUrl)
res, err := http.Get(ec2MetaDataServiceUrl)
if err != nil {
return "", fmt.Errorf("Error fetching %s", ec2MetaDataServiceUrl)
}

defer res.Body.Close()

body, err := ioutil.ReadAll(res.Body)
c := aws.NewConfig().
WithEndpoint(ec2MetaDataServiceUrl)
sess, err := session.NewSession()
if err != nil {
return "", fmt.Errorf("Error parsing %s", ec2MetaDataServiceUrl)
return "", fmt.Errorf("failed to create session")
}

var unmarshalled = map[string]string{}
err = json.Unmarshal(body, &unmarshalled)
if err != nil {
klog.Warningf("Error unmarshalling %s, skip...\n", ec2MetaDataServiceUrl)
}

region = unmarshalled["region"]
return ec2metadata.New(sess, c).Region()
}

return region, nil
Expand Down
Loading

0 comments on commit 1b0b9f5

Please sign in to comment.