Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AWS RDS hook's DB instance state check #34773

Merged
merged 3 commits into from
Nov 6, 2023

Conversation

AetherUnbound
Copy link
Contributor

Problem

We observed that when the RdsDbSensor is run against a database identifier which doesn't yet exist, the sensor fails and enters a retry sequence rather than emitting False and poking again at the next interval: WordPress/openverse#2961

The docs for get_db_instance_state say that this should raise AirflowNotFoundException if the DB instance doesn't exist, and the RdsDbSensor's poke method would seem to comport this with how it's expecting to handle an AirflowNotFoundException.

However, when running the hook code locally, the hook instead raises a DBInstanceNotFoundFault exception:

In [5]: response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')
---------------------------------------------------------------------------
DBInstanceNotFoundFault                   Traceback (most recent call last)
Cell In[5], line 1
----> 1 response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')

File ~/.local/lib/python3.10/site-packages/botocore/client.py:535, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    531     raise TypeError(
    532         f"{py_operation_name}() only accepts keyword arguments."
    533     )
    534 # The "self" in this scope is referring to the BaseClient.
--> 535 return self._make_api_call(operation_name, kwargs)

File ~/.local/lib/python3.10/site-packages/botocore/client.py:980, in BaseClient._make_api_call(self, operation_name, api_params)
    978     error_code = parsed_response.get("Error", {}).get("Code")
    979     error_class = self.exceptions.from_code(error_code)
--> 980     raise error_class(parsed_response, operation_name)
    981 else:
    982     return parsed_response

DBInstanceNotFoundFault: An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance dev-openverse-fake not found.

I tried extracting the error code that is checked here myself:

def get_db_instance_state(self, db_instance_id: str) -> str:
"""
Get the current state of a DB instance.
.. seealso::
- :external+boto3:py:meth:`RDS.Client.describe_db_instances`
:param db_instance_id: The ID of the target DB instance.
:return: Returns the status of the DB instance as a string (eg. "available")
:raises AirflowNotFoundException: If the DB instance does not exist.
"""
try:
response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
except self.conn.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
raise AirflowNotFoundException(e)
raise e
return response["DBInstances"][0]["DBInstanceStatus"].lower()

In [7]: try:
   ...:     response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')
   ...: except hook.conn.exceptions.ClientError as e:
   ...:     x = e
   ...: 

In [8]: x
Out[8]: botocore.errorfactory.DBInstanceNotFoundFault('An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance dev-openverse-fake not found.')

In [10]: x.response
Out[10]: 
{'Error': {'Type': 'Sender',
  'Code': 'DBInstanceNotFound',
  'Message': 'DBInstance dev-openverse-fake not found.'},
 'ResponseMetadata': {'RequestId': '[redacted]',
  'HTTPStatusCode': 404,
  'HTTPHeaders': {'x-amzn-requestid': '[redacted]',
   'strict-transport-security': 'max-age=31536000',
   'content-type': 'text/xml',
   'content-length': '289',
   'date': 'Thu, 05 Oct 2023 03:46:28 GMT'},
  'RetryAttempts': 0}}

It looks like the code that should be checked against is actually DBInstanceNotFound, even if the exception is DBInstanceNotFoundFault. I've made the change here, there are a few other places in this hook where DB[issue]Fault is used where perhaps DB[issue] should be used instead. But I wanted to get this PR up with a minimal change at least to get folks' thoughts 🙂


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@@ -240,7 +240,7 @@ def get_db_instance_state(self, db_instance_id: str) -> str:
try:
response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
except self.conn.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
if e.response["Error"]["Code"] == "DBInstanceNotFound":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AetherUnbound Thanks for creating the PR :)

Are there any version changes for SDK/Client/API used now vs earlier?

Copy link
Contributor

@Taragolis Taragolis Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to catch RDS.Client.exceptions.DBInstanceNotFoundFault (as self.conn.exceptions.DBInstanceNotFoundFault) rather than generic ClientError and parse it.

Copy link
Contributor

@vincbeck vincbeck Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But service exceptions are wrapped into ClientError. See documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is weird because from documentation it is DBInstanceNotFoundFault

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincbeck Right, I mention this in the PR description - the error code itself within the ClientError is DBInstanceNotFound, but the exception is DBInstanceNotFoundFault. I'm all for catching a more specific exception, but the other functions in this file will probably need to be updated as well and I wanted to confirm that was the right way forward before making that change 🙂

Are there any version changes for SDK/Client/API used now vs earlier?

Not that I'm aware of, I think this has been an issue since we started using this DAG! Let me take a look at our logs and get back to you though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But boto3 documentation is different from AWS API reference documentation:

  • Boto3 documentation: DBInstanceNotFoundFault
  • API reference documentation: DBInstanceNotFound

I trust your testing then and DBInstanceNotFound should be the good one 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response Error Code != Exception name

Some simple snippet for play with debugger

import boto3
from botocore.exceptions import ClientError

session = boto3.session.Session(...) # do not forget add creds/profile or set appropriate ENV Vars
client = session.client("rds")


try:
    client.describe_db_instances(DBInstanceIdentifier="foo-bar-spam-egg")
except client.exceptions.DBInstanceNotFoundFault as ex:
    assert isinstance(ex, ClientError)
    assert isinstance(ex, client.exceptions.ClientError)
    raise

image

@AetherUnbound
Copy link
Contributor Author

If y'all are up for it, I can also change the other *Fault string checks to remove the Fault, since it's likely those are encountering the same issue. Or we can merge this as-is, either way!

@vincbeck
Copy link
Contributor

vincbeck commented Oct 5, 2023

If y'all are up for it, I can also change the other *Fault string checks to remove the Fault, since it's likely those are encountering the same issue. Or we can merge this as-is, either way!

Would be great if you can do it as part of this PR. I'd double check this the correct error code though

@@ -240,7 +240,7 @@ def get_db_instance_state(self, db_instance_id: str) -> str:
try:
response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
except self.conn.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
if e.response["Error"]["Code"] == "DBInstanceNotFound":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to check both values since no one reported the issues before.

Suggested change
if e.response["Error"]["Code"] == "DBInstanceNotFound":
if e.response["Error"]["Code"] in ["DBInstanceNotFoundFault", "DBInstanceNotFound"]:

However if @Taragolis solution is applicable, it would be much better.

@AetherUnbound
Copy link
Contributor Author

I haven't forgotten about this, just got busy! It sounds like folks are comfortable with catching specific exceptions instead of checking error codes, which should be more robust in general. I'll modify this PR for all of the RDS functions to make it consistent in that regard.

@AetherUnbound
Copy link
Contributor Author

Interesting note, all of the exception types end in *Fault, but I believe the error codes themselves do not:

In [8]: dir(hook.conn.exceptions)
Out[8]: 
['AuthorizationAlreadyExistsFault',
 'AuthorizationNotFoundFault',
 'AuthorizationQuotaExceededFault',
 'BackupPolicyNotFoundFault',
 'BlueGreenDeploymentAlreadyExistsFault',
 'BlueGreenDeploymentNotFoundFault',
 'CertificateNotFoundFault',
 'ClientError',
 'CreateCustomDBEngineVersionFault',
 'CustomAvailabilityZoneNotFoundFault',
 'CustomDBEngineVersionAlreadyExistsFault',
 'CustomDBEngineVersionNotFoundFault',
 'CustomDBEngineVersionQuotaExceededFault',
 'DBClusterAlreadyExistsFault',
 'DBClusterAutomatedBackupNotFoundFault',
 'DBClusterAutomatedBackupQuotaExceededFault',
 'DBClusterBacktrackNotFoundFault',
 'DBClusterEndpointAlreadyExistsFault',
 'DBClusterEndpointNotFoundFault',
 'DBClusterEndpointQuotaExceededFault',
 'DBClusterNotFoundFault',
 'DBClusterParameterGroupNotFoundFault',
 'DBClusterQuotaExceededFault',
 'DBClusterRoleAlreadyExistsFault',
 'DBClusterRoleNotFoundFault',
 'DBClusterRoleQuotaExceededFault',
 'DBClusterSnapshotAlreadyExistsFault',
 'DBClusterSnapshotNotFoundFault',
 'DBInstanceAlreadyExistsFault',
 'DBInstanceAutomatedBackupNotFoundFault',
 'DBInstanceAutomatedBackupQuotaExceededFault',
 'DBInstanceNotFoundFault',
 'DBInstanceRoleAlreadyExistsFault',
 'DBInstanceRoleNotFoundFault',
 'DBInstanceRoleQuotaExceededFault',
 'DBLogFileNotFoundFault',
 'DBParameterGroupAlreadyExistsFault',
 'DBParameterGroupNotFoundFault',
 'DBParameterGroupQuotaExceededFault',
 'DBProxyAlreadyExistsFault',
 'DBProxyEndpointAlreadyExistsFault',
 'DBProxyEndpointNotFoundFault',
 'DBProxyEndpointQuotaExceededFault',
 'DBProxyNotFoundFault',
 'DBProxyQuotaExceededFault',
 'DBProxyTargetAlreadyRegisteredFault',
 'DBProxyTargetGroupNotFoundFault',
 'DBProxyTargetNotFoundFault',
 'DBSecurityGroupAlreadyExistsFault',
 'DBSecurityGroupNotFoundFault',
 'DBSecurityGroupNotSupportedFault',
 'DBSecurityGroupQuotaExceededFault',
 'DBSnapshotAlreadyExistsFault',
 'DBSnapshotNotFoundFault',
 'DBSubnetGroupAlreadyExistsFault',
 'DBSubnetGroupDoesNotCoverEnoughAZs',
 'DBSubnetGroupNotAllowedFault',
 'DBSubnetGroupNotFoundFault',
 'DBSubnetGroupQuotaExceededFault',
 'DBSubnetQuotaExceededFault',
 'DBUpgradeDependencyFailureFault',
 'DomainNotFoundFault',
 'Ec2ImagePropertiesNotSupportedFault',
 'EventSubscriptionQuotaExceededFault',
 'ExportTaskAlreadyExistsFault',
 'ExportTaskNotFoundFault',
 'GlobalClusterAlreadyExistsFault',
 'GlobalClusterNotFoundFault',
 'GlobalClusterQuotaExceededFault',
 'IamRoleMissingPermissionsFault',
 'IamRoleNotFoundFault',
 'InstanceQuotaExceededFault',
 'InsufficientAvailableIPsInSubnetFault',
 'InsufficientDBClusterCapacityFault',
 'InsufficientDBInstanceCapacityFault',
 'InsufficientStorageClusterCapacityFault',
 'InvalidBlueGreenDeploymentStateFault',
 'InvalidCustomDBEngineVersionStateFault',
 'InvalidDBClusterAutomatedBackupStateFault',
 'InvalidDBClusterCapacityFault',
 'InvalidDBClusterEndpointStateFault',
 'InvalidDBClusterSnapshotStateFault',
 'InvalidDBClusterStateFault',
 'InvalidDBInstanceAutomatedBackupStateFault',
 'InvalidDBInstanceStateFault',
 'InvalidDBParameterGroupStateFault',
 'InvalidDBProxyEndpointStateFault',
 'InvalidDBProxyStateFault',
 'InvalidDBSecurityGroupStateFault',
 'InvalidDBSnapshotStateFault',
 'InvalidDBSubnetGroupFault',
 'InvalidDBSubnetGroupStateFault',
 'InvalidDBSubnetStateFault',
 'InvalidEventSubscriptionStateFault',
 'InvalidExportOnlyFault',
 'InvalidExportSourceStateFault',
 'InvalidExportTaskStateFault',
 'InvalidGlobalClusterStateFault',
 'InvalidOptionGroupStateFault',
 'InvalidRestoreFault',
 'InvalidS3BucketFault',
 'InvalidSubnet',
 'InvalidVPCNetworkStateFault',
 'KMSKeyNotAccessibleFault',
 'NetworkTypeNotSupported',
 'OptionGroupAlreadyExistsFault',
 'OptionGroupNotFoundFault',
 'OptionGroupQuotaExceededFault',
 'PointInTimeRestoreNotEnabledFault',
 'ProvisionedIopsNotAvailableInAZFault',
 'ReservedDBInstanceAlreadyExistsFault',
 'ReservedDBInstanceNotFoundFault',
 'ReservedDBInstanceQuotaExceededFault',
 'ReservedDBInstancesOfferingNotFoundFault',
 'ResourceNotFoundFault',
 'SNSInvalidTopicFault',
 'SNSNoAuthorizationFault',
 'SNSTopicArnNotFoundFault',
 'SharedSnapshotQuotaExceededFault',
 'SnapshotQuotaExceededFault',
 'SourceClusterNotSupportedFault',
 'SourceDatabaseNotSupportedFault',
 'SourceNotFoundFault',
 'StorageQuotaExceededFault',
 'StorageTypeNotAvailableFault',
 'StorageTypeNotSupportedFault',
 'SubnetAlreadyInUse',
 'SubscriptionAlreadyExistFault',
 'SubscriptionCategoryNotFoundFault',
 'SubscriptionNotFoundFault',

Going to go ahead and change all the logic over!

@AetherUnbound
Copy link
Contributor Author

Crap, GitHub outage is affecting the build steps 😅 Anyone mind re-running them when you have a moment?

@vincbeck
Copy link
Contributor

vincbeck commented Nov 3, 2023

I just rebased it. It will re-execute the tests. Though, I am not sure the outage is over

@AetherUnbound
Copy link
Contributor Author

Weird, it looks like a few tests are failing because they're raising a ClientError instead of the actual exception O_o

FAILED tests/providers/amazon/aws/sensors/test_rds.py::TestRdsExportTaskExistenceSensor::test_export_task_poke_false - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier my-db-instance-snap-export is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_export_task_state_not_found - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier does_not_exist is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_event_subscription_state_not_found - botocore.exceptions.ClientError: An error occurred (SubscriptionNotFoundFault) when calling the DescribeEventSubscriptions operation: Subscription does_not_exist not found.

@vincbeck
Copy link
Contributor

vincbeck commented Nov 3, 2023

Weird, it looks like a few tests are failing because they're raising a ClientError instead of the actual exception O_o

FAILED tests/providers/amazon/aws/sensors/test_rds.py::TestRdsExportTaskExistenceSensor::test_export_task_poke_false - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier my-db-instance-snap-export is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_export_task_state_not_found - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier does_not_exist is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_event_subscription_state_not_found - botocore.exceptions.ClientError: An error occurred (SubscriptionNotFoundFault) when calling the DescribeEventSubscriptions operation: Subscription does_not_exist not found.

Yes, this is expected from the documentation:

But service exceptions are wrapped into ClientError. See documentation

@AetherUnbound
Copy link
Contributor Author

That's so odd, especially since local testing (shown in the issue description) raises a specific error in some cases 😅 Ah well, I'll change those pieces back.

@AetherUnbound
Copy link
Contributor Author

Hmm, that docs build failure seems unrelated 🤔

@potiuk potiuk force-pushed the bugfix/aws-rds-db-not-found branch from e4ae104 to 5b0daec Compare November 3, 2023 23:46
@potiuk
Copy link
Member

potiuk commented Nov 3, 2023

Hmm, that docs build failure seems unrelated 🤔

Yep. Rebased after:

a) we fixed it in main
b) proposed a PR #35424 that will prevent similar errors to get merged to main in the future

@vincbeck vincbeck merged commit f24e519 into apache:main Nov 6, 2023
45 checks passed
@vincbeck
Copy link
Contributor

vincbeck commented Nov 6, 2023

Thanks for the work @AetherUnbound 🥳

@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Nov 20, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants