Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Instance Meta Data Call Has Become Very Slow After Moving from IMDBv1 to IMDBv2 on RHEL/SELinux instance #3864

Closed
kirstenkissmeyer opened this issue Apr 14, 2021 · 5 comments
Assignees
Labels
bug This issue is a bug.

Comments

@kirstenkissmeyer
Copy link

kirstenkissmeyer commented Apr 14, 2021

This issue was experienced on a client m5.large instance with RHEL/SELinux configured via CloudFormation. The slowness introduced when we upgraded from IMDBv2 to IMDBv2 was observed in several working sessions focused on merging/validating/finalizing a 3 phase Cloudformation (CFN) deployment for this instance
The CFN template that gets instance meta data went from running in under a second to approx 50 seconds each time.

We had just switched a week before from IMDBv1 to IMDBv2 and the slowness started right after the swtich.
It is not specific to GO SDK and also occurs in the aws cli.

I am posting here initially since I found this similar 2972 issue in this repo that is also not specific to GO SDK and can occur with same cli call to get instance meta data:
#2972

a sample cli command to get instance meta data is:
[ec2-user ~]$ TOKEN=curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/

from this page describing the IMDBv1 vs IMDBv2:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html

A similar get instance meta data cli call on client ec2 instance eventually returns with the requested info - without triggering an HTTP or other errors , except for timeout warnings. This is very similar to the behavior described in issue 2972.

But there is another factor in the client env- The AMI that was used for the instance is a client AMI (approved by their Security team) that has SELinux installed on an m5.large instance with RHEL .
Client is not sure as to whether or not there are any actual SE policies configured in the AMI ( i will attempt to get this info and post back here over next few days). the SELinux mode of the AMI was running as "enforced" on initial start up, and for the first phase of the CFN configuration steps.
"Enforced" mode means if there are defined SE policies, they would be enforced.
The CFN templates runs in 3 phases/set of steps with a reboot in between each phase.
Prior to reboot at the end of this first phase, the SELinux mode was configured to "permissive" (which means no SE policies would be enforced - even if some did exist.)

Interestingly - Upon reboot, and during the second phase of cfn template configuration steps, the slowness went away and the curl get instance meta data calls again ran in under a second. This is why it seems that issue 2972 cause/resolution was not the same as for this case. 2972 cause is IMDBv2 changing HOPS limit default to 1 and fix was to increase it to higher number like 3.
It seems that just disabling any SELinux policies defined solved the problem.
It is unknown whether or not there may have been something else in the CFN phase 1 steps prior to the first reboot that also increased the HOPS limit, or maybe changed back to using IMDBv1 instead of IMDBv2. Again - will see if we can verify these other aspects.

I am wondering if anyone else has come across SELinux with IMDBv2 upgrade causing slowness in sdk/cli communications to get instance meta data - and hoping this issue can identify any potential specific policies that may cause this here.

Describe the bug
Get Instance Meta Data cli call slows way down(go from running in 1 second to 50 seconds) after upgrading from IMDBv1 to IMDBv2 on RHEL/SELinux m5.large instance

Version of cli:
we are installing/using the latest aws cli version on the ec2 instance where behavior is observed

To Reproduce (observed behavior)
Steps to reproduce the behavior (please share code or minimal repo)
M5.large with RHEL and SeLinux AMI for Oracle configuration

Note -will try to get more info on the policies configured for SELinux. will collect today / tomorrow/ over next week and update this.

posting in advance to see if someone might have encountered similar - SELinux slowing sdk/cli get instance meta data communications with IMDBv2 (and not IMDBv1)

Expected behavior
Without yet knowing the exact nature of any SELinux policies that may be in place, and how they may interact with IMDB changes introduced in V2 - it is hard to say specifically what expected behavior should be.

Definitely would be nice to get some SDK/cli detection messages / warnings of defined SE policies that may cause issues with SDK/Cli communications.

Will fill in more as I get more info ...

Additional context
Hoping any info gathered via this issue helps others identify why things may have slowed down dramatically on an instance with SELinux that has just been upgraded to IMDBv2.
In our CFN case, the call to get instance meta data was occurring over 60 times to get different properties for our cfn phase 1 scripts - so it slowed the process by an hour for just the first phase - making it extremely difficult to debug the cfn scripts themselves and iterate in an agile way.

@kirstenkissmeyer kirstenkissmeyer added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 14, 2021
@kirstenkissmeyer kirstenkissmeyer changed the title Get Instance Meta DataHas Become Very Slow After Moving from IMDBv1 to IMDBv2 on RHEL/SELinux instance Get Instance Meta Data Call Has Become Very Slow After Moving from IMDBv1 to IMDBv2 on RHEL/SELinux instance Apr 14, 2021
@github-actions
Copy link

We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Apr 15, 2022
@vudh1 vudh1 self-assigned this Apr 15, 2022
@vudh1
Copy link
Contributor

vudh1 commented Apr 15, 2022

Hi, is this still persisting with the newest version of SDK?

@vudh1 vudh1 added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Apr 15, 2022
@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Apr 16, 2022
@vudh1
Copy link
Contributor

vudh1 commented Apr 22, 2022

closing this for no response. Please reopen if this problem is still persisting.

@vudh1 vudh1 closed this as completed Apr 22, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@RichardFoo
Copy link

RichardFoo commented Oct 5, 2022

An interesting (relevant?) article on how IMDBv2 changed the token API call to return a reply with TTL=1 in the IP header. This causes problems when an EC2 instance has an internal router (e.g., containers using NAT; maybe also SELinux) because the TTL=1 packet gets dropped. Timeouts ensue before falling back to IMDBv1, and this causes a much slower response time (like >2sec instead of <3ms).
https://marcinchmiel.com/articles/2020-11/a-super-quick-way-to-speed-up-your-containers-on-aws/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

3 participants