Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration][AWS] - Fix ExpiredTokenException #1041

Merged
merged 21 commits into from
Sep 26, 2024
Merged

Conversation

mk-armah
Copy link
Member

@mk-armah mk-armah commented Sep 20, 2024

Description

What

  • Improved the mechanism for parallel fetching of AWS account resources.
  • Fixed ExpiredTokenException by replacing the event-based caching system with a time-dependent caching mechanism. The new approach ensures that the role is reassumed and session credentials are refreshed when 80% of the session duration has been used.

Why

  • The previous event-based caching system led to the ExpiredTokenException, causing session credentials to expire unexpectedly.
  • Implementing a time-dependent caching mechanism ensures that session credentials are refreshed proactively, preventing disruptions.

How

  • Replaced the resync-dependent caching system with a time-based cache that monitors the session expiry.
  • Added logic to reassume the role and refresh credentials once 80% of the session duration has passed, improving session reliability.

Type of change

Please leave one option from the following and delete the rest:

  • Bug fix (non-breaking change which fixes an issue)

All tests should be run against the port production environment(using a testing org).

Core testing checklist

  • Integration able to create all default resources from scratch
  • Resync finishes successfully
  • Resync able to create entities
  • Resync able to update entities
  • Resync able to detect and delete entities
  • Scheduled resync able to abort existing resync and start a new one
  • Tested with at least 2 integrations from scratch
  • Tested with Kafka and Polling event listeners
  • Tested deletion of entities that don't pass the selector

Integration testing checklist

  • Integration able to create all default resources from scratch
  • Resync able to create entities
  • Resync able to update entities
  • Resync able to detect and delete entities
  • Resync finishes successfully
  • If new resource kind is added or updated in the integration, add example raw data, mapping and expected result to the examples folder in the integration directory.
  • If resource kind is updated, run the integration with the example data and check if the expected result is achieved
  • If new resource kind is added or updated, validate that live-events for that resource are working as expected
  • Docs PR link here

Preflight checklist

  • Handled rate limiting
  • Handled pagination
  • Implemented the code in async
  • Support Multi account

Screenshots

Include screenshots from your environment showing how the resources of the integration will look.

API Documentation

Provide links to the API documentation used for this integration.

@mk-armah mk-armah requested a review from a team as a code owner September 20, 2024 16:13
@mk-armah mk-armah changed the title Port 10319/improvement aws [Integration][AWS] - Fix ExpiredTokenException Sep 20, 2024
Copy link
Contributor

@PeyGis PeyGis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job on using the sessionDuration in the assume role client. I left a comment on how the aiocache works for the ttl


async def update_available_access_credentials() -> None:
@cached(ttl=CACHE_DURATION_SECONDS, cache=Cache.MEMORY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand what happens when the cache is cleared after the duration? And how is the simple Memory different from using FastAPIs event.attribute functionality?

i guess my question is, since you changed the approach from event based to time based, what happens to the assumed role credentials after the ttl?

Also after the cache is cleared, how does it manage the re-entry. Thus, after the first ttl duration has elapsed, we signal that the credentials needs to be refreshed. Now what happens after the second hour. how does the ttl behave?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache is meant to expire after exhausting 80% of the session duration;
when the cache is cleared, the next call to update_available_access_credentials resets the session, producing new credentials thereby extending the expiry time once again...
This can happen at any point within a resync.

TTL, last for the specified time only, update_available_access_credentials is being called as many times as possible to ensure we know whenever the TTL is close to expiry.

@@ -93,7 +96,9 @@ async def _get_organization_session(self) -> aioboto3.Session | None:
async with application_session.client("sts") as sts_client:
try:
organizations_client = await sts_client.assume_role(
RoleArn=organization_role_arn, RoleSessionName="AssumeRoleSession"
RoleArn=organization_role_arn,
RoleSessionName="AssumeRoleSession",
Copy link
Member

@matan84 matan84 Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RoleSessionName="AssumeRoleSession",
RoleSessionName="OceanOrgAssumeRoleSession",

Copy link
Member

@matan84 matan84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Please make sure to add tests

integrations/aws/aws/session_manager.py Outdated Show resolved Hide resolved
integrations/aws/main.py Outdated Show resolved Hide resolved
integrations/aws/main.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added size/XL and removed size/M labels Sep 24, 2024
Copy link
Member

@matan84 matan84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments

integrations/aws/main.py Show resolved Hide resolved
integrations/aws/main.py Outdated Show resolved Hide resolved
integrations/aws/main.py Outdated Show resolved Hide resolved
integrations/aws/main.py Show resolved Hide resolved
Copy link
Member

@matan84 matan84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@PeyGis PeyGis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice

@matan84 matan84 merged commit 880b773 into main Sep 26, 2024
15 checks passed
@matan84 matan84 deleted the port-10319/improvement-aws branch September 26, 2024 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants