Fix AWS RDS hook's DB instance state check #34773

AetherUnbound · 2023-10-05T04:08:52Z

Problem

We observed that when the RdsDbSensor is run against a database identifier which doesn't yet exist, the sensor fails and enters a retry sequence rather than emitting False and poking again at the next interval: WordPress/openverse#2961

The docs for get_db_instance_state say that this should raise AirflowNotFoundException if the DB instance doesn't exist, and the RdsDbSensor's poke method would seem to comport this with how it's expecting to handle an AirflowNotFoundException.

However, when running the hook code locally, the hook instead raises a DBInstanceNotFoundFault exception:

In [5]: response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')
---------------------------------------------------------------------------
DBInstanceNotFoundFault                   Traceback (most recent call last)
Cell In[5], line 1
----> 1 response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')

File ~/.local/lib/python3.10/site-packages/botocore/client.py:535, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    531     raise TypeError(
    532         f"{py_operation_name}() only accepts keyword arguments."
    533     )
    534 # The "self" in this scope is referring to the BaseClient.
--> 535 return self._make_api_call(operation_name, kwargs)

File ~/.local/lib/python3.10/site-packages/botocore/client.py:980, in BaseClient._make_api_call(self, operation_name, api_params)
    978     error_code = parsed_response.get("Error", {}).get("Code")
    979     error_class = self.exceptions.from_code(error_code)
--> 980     raise error_class(parsed_response, operation_name)
    981 else:
    982     return parsed_response

DBInstanceNotFoundFault: An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance dev-openverse-fake not found.

I tried extracting the error code that is checked here myself:

airflow/airflow/providers/amazon/aws/hooks/rds.py

Lines 229 to 246 in 0c8e30e

    
               def get_db_instance_state(self, db_instance_id: str) -> str: 
        
                   """ 
        
                   Get the current state of a DB instance. 
        
                   .. seealso:: 
        
                       - :external+boto3:py:meth:`RDS.Client.describe_db_instances` 
        
                   :param db_instance_id: The ID of the target DB instance. 
        
                   :return: Returns the status of the DB instance as a string (eg. "available") 
        
                   :raises AirflowNotFoundException: If the DB instance does not exist. 
        
                   """ 
        
                   try: 
        
                       response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id) 
        
                   except self.conn.exceptions.ClientError as e: 
        
                       if e.response["Error"]["Code"] == "DBInstanceNotFoundFault": 
        
                           raise AirflowNotFoundException(e) 
        
                       raise e 
        
                   return response["DBInstances"][0]["DBInstanceStatus"].lower()

In [7]: try:
   ...:     response = hook.conn.describe_db_instances(DBInstanceIdentifier='dev-openverse-fake')
   ...: except hook.conn.exceptions.ClientError as e:
   ...:     x = e
   ...: 

In [8]: x
Out[8]: botocore.errorfactory.DBInstanceNotFoundFault('An error occurred (DBInstanceNotFound) when calling the DescribeDBInstances operation: DBInstance dev-openverse-fake not found.')

In [10]: x.response
Out[10]: 
{'Error': {'Type': 'Sender',
  'Code': 'DBInstanceNotFound',
  'Message': 'DBInstance dev-openverse-fake not found.'},
 'ResponseMetadata': {'RequestId': '[redacted]',
  'HTTPStatusCode': 404,
  'HTTPHeaders': {'x-amzn-requestid': '[redacted]',
   'strict-transport-security': 'max-age=31536000',
   'content-type': 'text/xml',
   'content-length': '289',
   'date': 'Thu, 05 Oct 2023 03:46:28 GMT'},
  'RetryAttempts': 0}}

It looks like the code that should be checked against is actually DBInstanceNotFound, even if the exception is DBInstanceNotFoundFault. I've made the change here, there are a few other places in this hook where DB[issue]Fault is used where perhaps DB[issue] should be used instead. But I wanted to get this PR up with a minimal change at least to get folks' thoughts 🙂

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

utkarsharma2 · 2023-10-05T06:58:10Z

airflow/providers/amazon/aws/hooks/rds.py

@@ -240,7 +240,7 @@ def get_db_instance_state(self, db_instance_id: str) -> str:
        try:
            response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
        except self.conn.exceptions.ClientError as e:
-            if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
+            if e.response["Error"]["Code"] == "DBInstanceNotFound":


@AetherUnbound Thanks for creating the PR :)

Are there any version changes for SDK/Client/API used now vs earlier?

Better to catch RDS.Client.exceptions.DBInstanceNotFoundFault (as self.conn.exceptions.DBInstanceNotFoundFault) rather than generic ClientError and parse it.

But service exceptions are wrapped into ClientError. See documentation

But this is weird because from documentation it is DBInstanceNotFoundFault

@vincbeck Right, I mention this in the PR description - the error code itself within the ClientError is DBInstanceNotFound, but the exception is DBInstanceNotFoundFault. I'm all for catching a more specific exception, but the other functions in this file will probably need to be updated as well and I wanted to confirm that was the right way forward before making that change 🙂

Are there any version changes for SDK/Client/API used now vs earlier?

Not that I'm aware of, I think this has been an issue since we started using this DAG! Let me take a look at our logs and get back to you though.

But boto3 documentation is different from AWS API reference documentation:

Boto3 documentation: DBInstanceNotFoundFault

API reference documentation: DBInstanceNotFound

I trust your testing then and DBInstanceNotFound should be the good one 👍

Response Error Code != Exception name

Some simple snippet for play with debugger

import boto3 from botocore.exceptions import ClientError session = boto3.session.Session(...) # do not forget add creds/profile or set appropriate ENV Vars client = session.client("rds") try: client.describe_db_instances(DBInstanceIdentifier="foo-bar-spam-egg") except client.exceptions.DBInstanceNotFoundFault as ex: assert isinstance(ex, ClientError) assert isinstance(ex, client.exceptions.ClientError) raise

AetherUnbound · 2023-10-05T17:36:43Z

If y'all are up for it, I can also change the other *Fault string checks to remove the Fault, since it's likely those are encountering the same issue. Or we can merge this as-is, either way!

vincbeck · 2023-10-05T18:20:09Z

If y'all are up for it, I can also change the other *Fault string checks to remove the Fault, since it's likely those are encountering the same issue. Or we can merge this as-is, either way!

Would be great if you can do it as part of this PR. I'd double check this the correct error code though

hussein-awala · 2023-10-05T21:17:32Z

airflow/providers/amazon/aws/hooks/rds.py

@@ -240,7 +240,7 @@ def get_db_instance_state(self, db_instance_id: str) -> str:
        try:
            response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
        except self.conn.exceptions.ClientError as e:
-            if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
+            if e.response["Error"]["Code"] == "DBInstanceNotFound":


I wonder if we need to check both values since no one reported the issues before.

Suggested change

if e.response["Error"]["Code"] == "DBInstanceNotFound":

if e.response["Error"]["Code"] in ["DBInstanceNotFoundFault", "DBInstanceNotFound"]:

However if @Taragolis solution is applicable, it would be much better.

AetherUnbound · 2023-10-10T15:19:11Z

I haven't forgotten about this, just got busy! It sounds like folks are comfortable with catching specific exceptions instead of checking error codes, which should be more robust in general. I'll modify this PR for all of the RDS functions to make it consistent in that regard.

AetherUnbound · 2023-11-03T18:49:46Z

Interesting note, all of the exception types end in *Fault, but I believe the error codes themselves do not:

In [8]: dir(hook.conn.exceptions)
Out[8]: 
['AuthorizationAlreadyExistsFault',
 'AuthorizationNotFoundFault',
 'AuthorizationQuotaExceededFault',
 'BackupPolicyNotFoundFault',
 'BlueGreenDeploymentAlreadyExistsFault',
 'BlueGreenDeploymentNotFoundFault',
 'CertificateNotFoundFault',
 'ClientError',
 'CreateCustomDBEngineVersionFault',
 'CustomAvailabilityZoneNotFoundFault',
 'CustomDBEngineVersionAlreadyExistsFault',
 'CustomDBEngineVersionNotFoundFault',
 'CustomDBEngineVersionQuotaExceededFault',
 'DBClusterAlreadyExistsFault',
 'DBClusterAutomatedBackupNotFoundFault',
 'DBClusterAutomatedBackupQuotaExceededFault',
 'DBClusterBacktrackNotFoundFault',
 'DBClusterEndpointAlreadyExistsFault',
 'DBClusterEndpointNotFoundFault',
 'DBClusterEndpointQuotaExceededFault',
 'DBClusterNotFoundFault',
 'DBClusterParameterGroupNotFoundFault',
 'DBClusterQuotaExceededFault',
 'DBClusterRoleAlreadyExistsFault',
 'DBClusterRoleNotFoundFault',
 'DBClusterRoleQuotaExceededFault',
 'DBClusterSnapshotAlreadyExistsFault',
 'DBClusterSnapshotNotFoundFault',
 'DBInstanceAlreadyExistsFault',
 'DBInstanceAutomatedBackupNotFoundFault',
 'DBInstanceAutomatedBackupQuotaExceededFault',
 'DBInstanceNotFoundFault',
 'DBInstanceRoleAlreadyExistsFault',
 'DBInstanceRoleNotFoundFault',
 'DBInstanceRoleQuotaExceededFault',
 'DBLogFileNotFoundFault',
 'DBParameterGroupAlreadyExistsFault',
 'DBParameterGroupNotFoundFault',
 'DBParameterGroupQuotaExceededFault',
 'DBProxyAlreadyExistsFault',
 'DBProxyEndpointAlreadyExistsFault',
 'DBProxyEndpointNotFoundFault',
 'DBProxyEndpointQuotaExceededFault',
 'DBProxyNotFoundFault',
 'DBProxyQuotaExceededFault',
 'DBProxyTargetAlreadyRegisteredFault',
 'DBProxyTargetGroupNotFoundFault',
 'DBProxyTargetNotFoundFault',
 'DBSecurityGroupAlreadyExistsFault',
 'DBSecurityGroupNotFoundFault',
 'DBSecurityGroupNotSupportedFault',
 'DBSecurityGroupQuotaExceededFault',
 'DBSnapshotAlreadyExistsFault',
 'DBSnapshotNotFoundFault',
 'DBSubnetGroupAlreadyExistsFault',
 'DBSubnetGroupDoesNotCoverEnoughAZs',
 'DBSubnetGroupNotAllowedFault',
 'DBSubnetGroupNotFoundFault',
 'DBSubnetGroupQuotaExceededFault',
 'DBSubnetQuotaExceededFault',
 'DBUpgradeDependencyFailureFault',
 'DomainNotFoundFault',
 'Ec2ImagePropertiesNotSupportedFault',
 'EventSubscriptionQuotaExceededFault',
 'ExportTaskAlreadyExistsFault',
 'ExportTaskNotFoundFault',
 'GlobalClusterAlreadyExistsFault',
 'GlobalClusterNotFoundFault',
 'GlobalClusterQuotaExceededFault',
 'IamRoleMissingPermissionsFault',
 'IamRoleNotFoundFault',
 'InstanceQuotaExceededFault',
 'InsufficientAvailableIPsInSubnetFault',
 'InsufficientDBClusterCapacityFault',
 'InsufficientDBInstanceCapacityFault',
 'InsufficientStorageClusterCapacityFault',
 'InvalidBlueGreenDeploymentStateFault',
 'InvalidCustomDBEngineVersionStateFault',
 'InvalidDBClusterAutomatedBackupStateFault',
 'InvalidDBClusterCapacityFault',
 'InvalidDBClusterEndpointStateFault',
 'InvalidDBClusterSnapshotStateFault',
 'InvalidDBClusterStateFault',
 'InvalidDBInstanceAutomatedBackupStateFault',
 'InvalidDBInstanceStateFault',
 'InvalidDBParameterGroupStateFault',
 'InvalidDBProxyEndpointStateFault',
 'InvalidDBProxyStateFault',
 'InvalidDBSecurityGroupStateFault',
 'InvalidDBSnapshotStateFault',
 'InvalidDBSubnetGroupFault',
 'InvalidDBSubnetGroupStateFault',
 'InvalidDBSubnetStateFault',
 'InvalidEventSubscriptionStateFault',
 'InvalidExportOnlyFault',
 'InvalidExportSourceStateFault',
 'InvalidExportTaskStateFault',
 'InvalidGlobalClusterStateFault',
 'InvalidOptionGroupStateFault',
 'InvalidRestoreFault',
 'InvalidS3BucketFault',
 'InvalidSubnet',
 'InvalidVPCNetworkStateFault',
 'KMSKeyNotAccessibleFault',
 'NetworkTypeNotSupported',
 'OptionGroupAlreadyExistsFault',
 'OptionGroupNotFoundFault',
 'OptionGroupQuotaExceededFault',
 'PointInTimeRestoreNotEnabledFault',
 'ProvisionedIopsNotAvailableInAZFault',
 'ReservedDBInstanceAlreadyExistsFault',
 'ReservedDBInstanceNotFoundFault',
 'ReservedDBInstanceQuotaExceededFault',
 'ReservedDBInstancesOfferingNotFoundFault',
 'ResourceNotFoundFault',
 'SNSInvalidTopicFault',
 'SNSNoAuthorizationFault',
 'SNSTopicArnNotFoundFault',
 'SharedSnapshotQuotaExceededFault',
 'SnapshotQuotaExceededFault',
 'SourceClusterNotSupportedFault',
 'SourceDatabaseNotSupportedFault',
 'SourceNotFoundFault',
 'StorageQuotaExceededFault',
 'StorageTypeNotAvailableFault',
 'StorageTypeNotSupportedFault',
 'SubnetAlreadyInUse',
 'SubscriptionAlreadyExistFault',
 'SubscriptionCategoryNotFoundFault',
 'SubscriptionNotFoundFault',

Going to go ahead and change all the logic over!

AetherUnbound · 2023-11-03T19:10:30Z

Crap, GitHub outage is affecting the build steps 😅 Anyone mind re-running them when you have a moment?

vincbeck · 2023-11-03T19:17:12Z

I just rebased it. It will re-execute the tests. Though, I am not sure the outage is over

AetherUnbound · 2023-11-03T19:36:29Z

Weird, it looks like a few tests are failing because they're raising a ClientError instead of the actual exception O_o

FAILED tests/providers/amazon/aws/sensors/test_rds.py::TestRdsExportTaskExistenceSensor::test_export_task_poke_false - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier my-db-instance-snap-export is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_export_task_state_not_found - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier does_not_exist is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_event_subscription_state_not_found - botocore.exceptions.ClientError: An error occurred (SubscriptionNotFoundFault) when calling the DescribeEventSubscriptions operation: Subscription does_not_exist not found.

vincbeck · 2023-11-03T19:47:39Z

Weird, it looks like a few tests are failing because they're raising a ClientError instead of the actual exception O_o

FAILED tests/providers/amazon/aws/sensors/test_rds.py::TestRdsExportTaskExistenceSensor::test_export_task_poke_false - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier my-db-instance-snap-export is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_export_task_state_not_found - botocore.exceptions.ClientError: An error occurred (ExportTaskNotFoundFault) when calling the DescribeExportTasks operation: Cannot cancel export task because a task with the identifier does_not_exist is not exist.
FAILED tests/providers/amazon/aws/hooks/test_rds.py::TestRdsHook::test_get_event_subscription_state_not_found - botocore.exceptions.ClientError: An error occurred (SubscriptionNotFoundFault) when calling the DescribeEventSubscriptions operation: Subscription does_not_exist not found.

Yes, this is expected from the documentation:

But service exceptions are wrapped into ClientError. See documentation

AetherUnbound · 2023-11-03T20:25:24Z

That's so odd, especially since local testing (shown in the issue description) raises a specific error in some cases 😅 Ah well, I'll change those pieces back.

AetherUnbound · 2023-11-03T22:26:03Z

Hmm, that docs build failure seems unrelated 🤔

potiuk · 2023-11-03T23:47:12Z

Hmm, that docs build failure seems unrelated 🤔

Yep. Rebased after:

a) we fixed it in main
b) proposed a PR #35424 that will prevent similar errors to get merged to main in the future

vincbeck · 2023-11-06T15:59:44Z

Thanks for the work @AetherUnbound 🥳

AetherUnbound requested review from eladkal and o-nikolas as code owners October 5, 2023 04:08

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Oct 5, 2023

AetherUnbound mentioned this pull request Oct 5, 2023

Increase timeout while waiting for RDS database rename WordPress/openverse#2961

Closed

utkarsharma2 reviewed Oct 5, 2023

View reviewed changes

vincbeck approved these changes Oct 5, 2023

View reviewed changes

hussein-awala reviewed Oct 5, 2023

View reviewed changes

AetherUnbound added 3 commits November 4, 2023 00:45

Fix AWS RDS hook's DB instance state check

746efe7

Replace exception checking string with specific exception match

d583ff1

Revert the two exceptions that don't raise specific faults

5b0daec

potiuk force-pushed the bugfix/aws-rds-db-not-found branch from e4ae104 to 5b0daec Compare November 3, 2023 23:46

vincbeck merged commit f24e519 into apache:main Nov 6, 2023
45 checks passed

AetherUnbound deleted the bugfix/aws-rds-db-not-found branch November 6, 2023 18:40

eladkal mentioned this pull request Nov 8, 2023

Status of testing Providers that were prepared on November 08, 2023 #35540

Closed

60 tasks

Taragolis mentioned this pull request Nov 9, 2023

RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True #35563

Closed

2 tasks

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Nov 10, 2023

Fix AWS RDS hook's DB instance state check (apache#34773)

a667c91

ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Nov 20, 2023

ephraimbuddy added this to the Airflow 2.8.0 milestone Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AWS RDS hook's DB instance state check #34773

Fix AWS RDS hook's DB instance state check #34773

AetherUnbound commented Oct 5, 2023

utkarsharma2 Oct 5, 2023

Taragolis Oct 5, 2023 •

edited

Loading

vincbeck Oct 5, 2023 •

edited

Loading

vincbeck Oct 5, 2023

AetherUnbound Oct 5, 2023

vincbeck Oct 5, 2023

Taragolis Oct 5, 2023

AetherUnbound commented Oct 5, 2023

vincbeck commented Oct 5, 2023

hussein-awala Oct 5, 2023

AetherUnbound commented Oct 10, 2023

AetherUnbound commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

vincbeck commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

vincbeck commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

potiuk commented Nov 3, 2023 •

edited

Loading

vincbeck commented Nov 6, 2023

	def get_db_instance_state(self, db_instance_id: str) -> str:
	"""
	Get the current state of a DB instance.

	.. seealso::
	- :external+boto3:py:meth:`RDS.Client.describe_db_instances`

	:param db_instance_id: The ID of the target DB instance.
	:return: Returns the status of the DB instance as a string (eg. "available")
	:raises AirflowNotFoundException: If the DB instance does not exist.
	"""
	try:
	response = self.conn.describe_db_instances(DBInstanceIdentifier=db_instance_id)
	except self.conn.exceptions.ClientError as e:
	if e.response["Error"]["Code"] == "DBInstanceNotFoundFault":
	raise AirflowNotFoundException(e)
	raise e
	return response["DBInstances"][0]["DBInstanceStatus"].lower()

	if e.response["Error"]["Code"] == "DBInstanceNotFound":
	if e.response["Error"]["Code"] in ["DBInstanceNotFoundFault", "DBInstanceNotFound"]:

Fix AWS RDS hook's DB instance state check #34773

Fix AWS RDS hook's DB instance state check #34773

Conversation

AetherUnbound commented Oct 5, 2023

Problem

utkarsharma2 Oct 5, 2023

Choose a reason for hiding this comment

Taragolis Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

vincbeck Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

vincbeck Oct 5, 2023

Choose a reason for hiding this comment

AetherUnbound Oct 5, 2023

Choose a reason for hiding this comment

vincbeck Oct 5, 2023

Choose a reason for hiding this comment

Taragolis Oct 5, 2023

Choose a reason for hiding this comment

AetherUnbound commented Oct 5, 2023

vincbeck commented Oct 5, 2023

hussein-awala Oct 5, 2023

Choose a reason for hiding this comment

AetherUnbound commented Oct 10, 2023

AetherUnbound commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

vincbeck commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

vincbeck commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

AetherUnbound commented Nov 3, 2023

potiuk commented Nov 3, 2023 • edited Loading

vincbeck commented Nov 6, 2023

Taragolis Oct 5, 2023 •

edited

Loading

vincbeck Oct 5, 2023 •

edited

Loading

potiuk commented Nov 3, 2023 •

edited

Loading