Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get credentials from role has become slow #2972

Closed
alexd765 opened this issue Nov 25, 2019 · 22 comments · Fixed by #3066 or #3068
Closed

Get credentials from role has become slow #2972

alexd765 opened this issue Nov 25, 2019 · 22 comments · Fixed by #3066 or #3068

Comments

@alexd765
Copy link

Please fill out the sections below to help us address your issue.

Version of AWS SDK for Go?

v1.25.41

Version of Go (go version)?

go version go1.11.1 linux/amd64

What issue did you see?

Getting credentials from an ec2 role takes 20 seconds.

We have this issue on consistently on multiple different ec2 instances since a couple of days ago.

To me it looks like an issue with the service, but I am not sure.

Steps to reproduce

package slow

import (
	"fmt"
	"time"

	"github.com/aws/aws-sdk-go/aws/credentials"
	"github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds"
	"github.com/aws/aws-sdk-go/aws/ec2metadata"
	"github.com/aws/aws-sdk-go/aws/session"
)

func getCredentialsFromRole() (*credentials.Credentials, error) {
	roleProvider := &ec2rolecreds.EC2RoleProvider{
		Client: ec2metadata.New(session.New()),
	}
	creds := credentials.NewCredentials(roleProvider)

	start := time.Now().UTC()
	if _, err := creds.Get(); err != nil { // this takes 20 seconds
		return nil, err
	}
	fmt.Printf("getting credentails from role took %s\n", time.Now().UTC().Sub(start))

	return creds, nil
}
@diehlaws diehlaws self-assigned this Nov 25, 2019
@diehlaws
Copy link
Contributor

Thanks for reaching out to us about this @alexd765. The behavior you describe sounds like it is taking a long time for the EC2 metadata service on your instance(s) to respond to a request for instance profile credentials. Do you see this behavior only on specific instance types/sizes? (If so, which ones?) Or does this occur consistently regardless of the instance type/size being used? Do you see a similar delay when attempting to reach the instance metadata service in a manner that does not involve the SDK (e.g. time curl http://169.254.169.254/latest/)?

@diehlaws diehlaws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Nov 25, 2019
@alexd765
Copy link
Author

alexd765 commented Nov 26, 2019

I see it on t2.micro, t2.small, c5.4xlarge.

We are in eu-west-1.

curl http://169.254.169.254/latest/ is instant on the affected machines.

Interestingly it is only getting the credentials from role that is slow. Getting credentials from environment vars is also instant.

@nlundbo
Copy link

nlundbo commented Nov 26, 2019

We also saw similar behavior, we noticed issues on the operations

  • S3 -> ListBuckets
  • ec2MetadataService -> Region()
    both took 20+ seconds to complete. We noticed that going back to the previous version (for us it was v1.25.27) it was quick again. We saw the issue both in 1.25.40 and 1.25.41.

The above curl is also super-fast for us (only tested on t2.small)

@jasdel
Copy link
Contributor

jasdel commented Nov 26, 2019

Thanks for raising this issue, @alexd765, @nlundbo. Is your application running in a Kubernetes pod, Docker container, or using any IP forwarding/proxy?

If so you may be impacted by a recent security change to EC2 instances, where the Instance Metadata service will limit the number of hops a request can make. Increasing hop limit on the instance should take care of long timeout issues on EC2 for customers who use any IP forwarding/proxying.

The EC2 ModifyInstanceMetadataOptions operation can be used to update the hop limit needed for your application's use case.

See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html for more information on EC2's IMDS update.

@nlundbo
Copy link

nlundbo commented Nov 28, 2019

Yes, my use case was code inside a docker-container. I tested changing the number hops - and it resolved the issue. I read somewhere that it should not affect deployments in ECS - I have not verified this yet.

@pmalekn
Copy link

pmalekn commented Dec 2, 2019

I'm also seeing this issue (I believe it's the same) and I've been able to pin point v1.25.38 as the culprit release.

Before v1.25.38 I can see the following requests being made by the sdk (when running on a docker container on EC2):

DEBU[2019-12-02T15:53:48.399Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetMetadata Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nGET /latest/meta-data/iam/security-credentials/ HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.37 (go1.13.4; linux; amd64)\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T15:53:48.4Z] aws-sdk-message="[DEBUG: Response ec2metadata/GetMetadata Details:\n---[ RESPONSE ]--------------------------------------\nHTTP/1.1 200 OK\r\nConnection: close\r\nContent-Length: 38\r\nAccept-Ranges: none\r\nContent-Type: text/plain\r\nDate: Mon, 02 Dec 2019 15:53:48 GMT\r\nLast-Modified: Mon, 02 Dec 2019 14:55:48 GMT\r\nServer: EC2ws\r\n\r\n\n-----------------------------------------------------]"

while on v1.25.38 and newer:

DEBU[2019-12-02T16:00:54.872Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetToken Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nPUT /latest/api/token HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.38 (go1.13.4; linux; amd64)\r\nContent-Length: 0\r\nX-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T16:00:59.873Z] aws-sdk-message="[DEBUG: Send Request ec2metadata/GetToken failed, attempt 0/3, error RequestError: send request failed\ncaused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]"
DEBU[2019-12-02T16:00:59.914Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetToken Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nPUT /latest/api/token HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.38 (go1.13.4; linux; amd64)\r\nContent-Length: 0\r\nX-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T16:01:04.914Z] aws-sdk-message="[DEBUG: Send Request ec2metadata/GetToken failed, attempt 1/3, error RequestError: send request failed\ncaused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]"
DEBU[2019-12-02T16:01:05.009Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetToken Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nPUT /latest/api/token HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.38 (go1.13.4; linux; amd64)\r\nContent-Length: 0\r\nX-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T16:01:10.009Z] aws-sdk-message="[DEBUG: Send Request ec2metadata/GetToken failed, attempt 2/3, error RequestError: send request failed\ncaused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]"
DEBU[2019-12-02T16:01:10.191Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetToken Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nPUT /latest/api/token HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.38 (go1.13.4; linux; amd64)\r\nContent-Length: 0\r\nX-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T16:01:15.191Z] aws-sdk-message="[DEBUG: Send Request ec2metadata/GetToken failed, attempt 3/3, error RequestError: send request failed\ncaused by: Put http://169.254.169.254/latest/api/token: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]"
// Fallback to "/latest/meta-data/iam/security-credentials/"
DEBU[2019-12-02T16:01:15.191Z] aws-sdk-message="[DEBUG: Request ec2metadata/GetMetadata Details:\n---[ REQUEST POST-SIGN ]-----------------------------\nGET /latest/meta-data/iam/security-credentials/ HTTP/1.1\r\nHost: 169.254.169.254\r\nUser-Agent: aws-sdk-go/1.25.38 (go1.13.4; linux; amd64)\r\nAccept-Encoding: gzip\r\n\r\n\n-----------------------------------------------------]"
DEBU[2019-12-02T16:01:15.192Z] aws-sdk-message="[DEBUG: Response ec2metadata/GetMetadata Details:\n---[ RESPONSE ]--------------------------------------\nHTTP/1.1 200 OK\r\nConnection: close\r\nContent-Length: 38\r\nAccept-Ranges: none\r\nContent-Type: text/plain\r\nDate: Mon, 02 Dec 2019 16:01:15 GMT\r\nLast-Modified: Mon, 02 Dec 2019 15:55:34 GMT\r\nServer: EC2ws\r\n\r\n\n-----------------------------------------------------]"

I run this app without specifying AWS env keys nor with a shared profile.

@tyrken
Copy link

tyrken commented Dec 2, 2019

@pmalekn - They introduced the new EC2 metadata feature in v1.25.38 - and a backwards compatibility issue by setting the default hop limit to 1 - which means the replies get dropped while transiting the docker bridge network.

You have to increase the hop limit yourself or it just hit lengthy timeouts & retries before eventual fallback to the old method

@pmalekn
Copy link

pmalekn commented Dec 3, 2019

@tyrken Ok... so you're probably referring to 7814a7f#diff-68798fbeff74277f6e5c5ab6f1cf92c6R70454-R70460

Adding a snippet since the file is too big to show when linked service/ec2/api.go

// The metadata options for the instance.
type InstanceMetadataOptionsRequest struct {
	_ struct{} `type:"structure"`

	// This parameter enables or disables the HTTP metadata endpoint on your instances.
	// If the parameter is not specified, the default state is enabled.
	//
	// If you specify a value of disabled, you will not be able to access your instance
	// metadata.
	HttpEndpoint *string `type:"string" enum:"InstanceMetadataEndpointState"`

	// The desired HTTP PUT response hop limit for instance metadata requests. The
	// larger the number, the further instance metadata requests can travel.
	//
	// Default: 1
	//
	// Possible values: Integers from 1 to 64
	HttpPutResponseHopLimit *int64 `type:"integer"`

It then makes sense that it cannot reach the endpoint if I'm calling it from inside a docker container (because there's one more hop through docker bridge right?) or is my reasoning flawed?

How can I change this "globally" then? Or do I have to do it per request basis (that doesn't sound feasible)

@tyrken
Copy link

tyrken commented Dec 3, 2019

No global option I know of, though hoping AWS might see sense & implement something like #2980

In the mean time you can only stay on old version of aws-sdk or update all EC2 instances to increase the hop limit, e.g. with:

aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2

@alexd765
Copy link
Author

alexd765 commented Dec 3, 2019

Our problem is the same as described above.

aws ec2 modify-instance-metadata-options --instance-id i-34215432543254235 --http-endpoint enabled --http-put-response-hop-limit 2
is a temporary solution.

I really hope their will be a better solution in the future.
Running Go with this sdk in a docker container on an EC2 instance is one of the most common use cases of this sdk that I can think of. This really should just work out of the box.

skotambkar added a commit that referenced this issue Jan 7, 2020
…etadata client (#3066)

The PR address the issues related to EC2Metadata client having long timeouts in case of failure in obtaining EC2Metadata token, while making a request to IMDS.

The PR reduces the timeout to 1 sec, and number of max retries to 2 for the EC2Metadata client. This would help reduce the long timeouts faced by the customers.

Fixes #2972
aws-sdk-go-automation pushed a commit that referenced this issue Jan 7, 2020
===

### Service Client Updates
* `service/AWSMigrationHub`: Updates service API, documentation, and paginators
* `service/codebuild`: Updates service API and documentation
  * Add encryption key override to StartBuild API in AWS CodeBuild.
* `service/xray`: Updates service documentation
  * Documentation updates for xray

### SDK Enhancements
* `aws`: Add configuration option enable the SDK to unmarshal API response header maps to normalized lower case map keys. ([#3033](#3033))
  * Setting `aws.Config.LowerCaseHeaderMaps` to `true` will result in S3's X-Amz-Meta prefixed header to be unmarshaled to lower case Metadata member's map keys.

### SDK Bugs
* `aws/ec2metadata` : Reduces request timeout for EC2Metadata client along with maximum number of retries ([#3066](#3066))
  * Reduces latency while fetching response from EC2Metadata client running in a container to around 3 seconds
  * Fixes [#2972](#2972)
aws-sdk-go-automation added a commit that referenced this issue Jan 7, 2020
Release v1.27.2 (2020-01-07)
===

### Service Client Updates
* `service/AWSMigrationHub`: Updates service API, documentation, and paginators
* `service/codebuild`: Updates service API and documentation
  * Add encryption key override to StartBuild API in AWS CodeBuild.
* `service/xray`: Updates service documentation
  * Documentation updates for xray

### SDK Enhancements
* `aws`: Add configuration option enable the SDK to unmarshal API response header maps to normalized lower case map keys. ([#3033](#3033))
  * Setting `aws.Config.LowerCaseHeaderMaps` to `true` will result in S3's X-Amz-Meta prefixed header to be unmarshaled to lower case Metadata member's map keys.

### SDK Bugs
* `aws/ec2metadata` : Reduces request timeout for EC2Metadata client along with maximum number of retries ([#3066](#3066))
  * Reduces latency while fetching response from EC2Metadata client running in a container to around 3 seconds
  * Fixes [#2972](#2972)
@caleb15
Copy link

caleb15 commented Dec 18, 2020

@alexd765 's solution worked for me - thanks Alex!

You can implement his solution in terraform like so:

resource "aws_instance" "your_ec2_instance_name" {
  params_go_here                     = var.blablabla
  
  metadata_options {
    # So docker can access ec2 metadata
    # see https://github.com/aws/aws-sdk-go/issues/2972
    http_put_response_hop_limit = 2
  }
}

Also note that it's not just go that runs into this problem - even just using curl as described in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html results in a long delay.

@julvo
Copy link

julvo commented Jan 11, 2021

Ran into the same problem using the AWS go SDK in an Elastic Beanstalk (EB) Docker environment, but didn't see a way to set http_put_response_hop_limit in EB without manipulating the EC2 instance directly.

Fixed (or rather sidestepped) it by setting AWS user credentials and default region as environment variables - hope it helps others in the same situation.

@xubofei1983
Copy link

We get same problem and walk-around works.

lfittl added a commit to pganalyze/collector that referenced this issue Mar 9, 2021
Previously we were re-using our shared HTTP client, which has a rather
high timeout (120 seconds) that causes the HTTP client to wait around
for a long time. This is generally intentional (since it includes the
time spent downloading a request body), but is a bad idea when running
into EC2's IDMSv2 service that has a network-hop based limit. If that
hop limit is exceeded, the requests just go to nowhere, causing the
client to wait for a multiple of 120 seconds (~10 minutes were observed).

This instead uses a special client for the EC2 instance metadata service
that has a much lower timeout (1 second, like in the AWS SDK itself), to
avoid the problem.

See also aws/aws-sdk-go#2972
lfittl added a commit to pganalyze/collector that referenced this issue Mar 10, 2021
Previously we were re-using our shared HTTP client, which has a rather
high timeout (120 seconds) that causes the HTTP client to wait around
for a long time. This is generally intentional (since it includes the
time spent downloading a request body), but is a bad idea when running
into EC2's IDMSv2 service that has a network-hop based limit. If that
hop limit is exceeded, the requests just go to nowhere, causing the
client to wait for a multiple of 120 seconds (~10 minutes were observed).

This instead uses a special client for the EC2 instance metadata service
that has a much lower timeout (1 second, like in the AWS SDK itself), to
avoid the problem.

See also aws/aws-sdk-go#2972
@dulithag
Copy link

Hi guys I am wondering is this same issue exist for go-sdk-v2 ?

@skmcgrail skmcgrail added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Mar 19, 2021
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Mar 20, 2021
@caleb15
Copy link

caleb15 commented Aug 23, 2021

Can we remove the stale lifecycle on this one please @skmcgrail ? I'm presuming it's still an issue and the reason people aren't commenting on it is because of the workaround. A lot of people have ran into this problem and it would be good to have explicit confirmation that it is fixed when it gets fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment